關於Lucene以及索引和搜尋的流程

阿新 • • 發佈：2019-01-18

Lucene的普及和成功的背後是因為它的簡單。

　　因此，你不需要深入理解Lucene的資訊索引和檢索工作方面的知識就可以開始使用。

　　Lucene提供了簡單但是強大的核心API去實現全文索引和檢索，你只需要掌握少數的類就能將Lucene整合到應用中。

　　剛接觸Lucene的人可能會誤認為Lucene是一個檔案搜尋工具、網路爬蟲、或者網頁搜尋引擎。實際上Lucene是一個軟體庫，而不是一個全功能的搜尋應用程式。它涉及全文索引和搜尋，而且做得非常好。Lucene可以讓你的應用程式隱藏起復雜的索引和搜尋背後的操作，而使用簡單的API處理特定的問題領域和業務規則。你可以想象Lucene就是像一個層，你的應用就在層的上面。

　　Lucene允許你新增索引和搜尋功能到應用程式中。Lucene不關心資料的來源，Lucene可以索引和搜尋任何可以轉換成文字格式的資料。這意味著你可以用Lucene索引和搜尋資料：遠端web伺服器上的網頁、儲存在本地檔案系統的文件、簡單的文字檔案、Microsoft Word文件、HTML或PDF檔案，或者其他任何可以從中提取文字資訊的格式檔案。

　　所有搜尋引擎的核心就是索引的概念：把原始資料處理成一個高效的交叉引用查詢，以便快速檢索。讓我們看看快速高效的索引和搜尋過程。

　　1.索引是什麼，為什麼它這麼重要？

　　假如你需要搜尋大量的檔案，你希望找到那些包含某個單詞或片語的檔案。你將如何去寫一個程式實現這個功能？一個做法就是按順序掃描每一個檔案，搜尋是否包含給定的單詞或片語。但是這樣的做法有很多缺陷的，其中最明顯的就是在大量的檔案存在的情況下，速度是令人無法接受的。這種情況下，索引產生了。為了搜尋大量的文字，你首先要對這些文字以特定的結構儲存，這種儲存結構可以讓你迅速的搜尋，消除慢的順序掃描的過程。這種儲存結構就叫索引，將文字轉換成特定結構儲存的過程，就叫建立索引。

　　索引作為一種資料結構，允許你快速隨機的訪問儲存在裡面的詞。類似於字典的目錄，某個詞對應到某一頁，查詢的時候直接定位到那一頁，速度就非常快，不用一頁一頁的翻去查詢。Lucene的索引是一種專門設計的資料結構，通常作為一組索引檔案儲存在檔案系統上。

　　2.什麼是搜尋？

　　在索引中搜索關鍵詞，找到包含關鍵詞的文件的過程就是搜尋。搜尋質量通常使用準確度和召回率來描述。所謂召回率是指一次搜尋結果集合中符合使用者要求的數目與和使用者查詢相關的總數之比，而準確率是指一次搜尋結果集合中符合使用者要求的數目與該次搜尋結果總數之比。我們也需要考慮其他有關搜尋的因素，比如速度和快速搜尋大量文字的能力，單個和多項查詢、短語查詢、萬用字元、結果的排名和排序的支援也很重要。

　　3.Lucene in Action

　　假如我們需要索引和搜尋儲存在一個目錄下的檔案。

　　在我們使用Lucene進行搜尋之前，我們需要先建立索引。使用的Lucene的版本是3.6。

　　3.1建立索引

　　1）建立存放索引的目錄Directory

　　2）建立索引器配置管理類IndexWriterConfig

　　3）使用索引目錄和配置管理類建立索引器

　　4）使用索引器將Document寫到索引檔案中

　　索引器類：
　　

    /**
 * 索引器
 * @author Luxh
 */
public class Indexer {


    /**
     * 建立索引
     * @param filePath 需要建立索引的檔案的存放路徑
     * @throws IOException
     */
    public static void createIndex(String filePath) throws IOException {

        //在當前路徑下建立一個叫indexDir的目錄
        File indexDir = new File("./indexDir");

        //建立索引目錄
        Directory directory = FSDirectory.open(indexDir);

        //建立一個分詞器
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

        //建立索引配置器
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_36,analyzer);


        LogMergePolicy mergePolicy = new LogByteSizeMergePolicy();

        //設定segment新增文件(Document)時的合併頻率
        //值較小,建立索引的速度就較慢
        //值較大,建立索引的速度就較快,>10適合批量建立索引
        mergePolicy.setMergeFactor(50);

        //設定segment最大合併文件(Document)數
        //值較小有利於追加索引的速度
        //值較大,適合批量建立索引和更快的搜尋
        mergePolicy.setMaxMergeDocs(5000);

        //啟用複合式索引檔案格式,合併多個segment
        mergePolicy.setUseCompoundFile(true);


        indexWriterConfig.setMergePolicy(mergePolicy);

        //設定索引的開啟模式
        indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);

        //建立索引器
        IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);


        File fileDir = new File(filePath);
        for(File file : fileDir.listFiles()) {
            //Document是Lucene的文件結構，需要索引的物件都要轉換為Document
            Document document = new Document();

            //檔名,可查詢,分詞,儲存到索引庫記錄中
            document.add(new Field("name",getFileName(file),Store.YES,Index.ANALYZED));

            //檔案路徑,可查詢,不分詞,儲存到索引庫記錄中
            document.add(new Field("path",file.getAbsolutePath(),Store.YES,Index.NOT_ANALYZED));

            //大文字內容,可查詢,不儲存,實際上可根據檔案路徑去找到真正的文字內容
            //document.add(new Field("content",new FileReader(file)));

            //小文字內容，可以儲存到索引記錄庫
            document.add(new Field("content",getFileContent(file),Store.YES,Index.ANALYZED));

            //把文件新增到索引庫
            indexWriter.addDocument(document);

        }

        //提交索引到磁碟上的索引庫,關閉索引器
        indexWriter.close();

    }

    /**
     * 獲取檔名
     */
    public static String getFileName(File file) {
        String fileName = "";
        if(file != null) {
            fileName = file.getName().substring(0, file.getName().lastIndexOf("."));
        }
        return fileName;
    }

    /**
     * 獲取文字
     * @param file
     */
    public static String getFileContent(File file) {
        FileReader fr = null;
        BufferedReader br = null;
        String content = "";
        try {
            fr = new FileReader(file);
            br = new BufferedReader(fr);
            StringBuffer sb = new StringBuffer();
            String line = br.readLine();
            while(null != line){
                sb.append(line);
                line = br.readLine();
            }
            content = sb.toString();
        }catch(Exception e) {
            e.printStackTrace();
        }finally {
            try {
                if(fr != null)
                    fr.close();
                if(br != null)
                    br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return content;

    }


}

IndexWriter：索引器，負責建立和維護一條索引。

　　在Lucene3.6版本，只推薦使用一個構造方法IndexWriter(Directory d,IndexWriterConfig conf)，其他的構造方法都已經過時。所有關於IndexWriter的配置都是通過IndexWriterConfig來進行管理。

　　IndexWriterConfig：索引器配置類，管理所有有關索引器的配置。只有一個構造方法IndexWriterConfig(Version matchVersion,Analyzer analyzer)，構造方法中的引數matchVersion是Lucene的版本，analyzer是分詞器。

　　接下來我們執行索引器建立索引。
　　

public class TestIndexer {

    /**
     * 建立索引
     * @throws IOException
     */
    @Test
    public void testCreateIndex() throws IOException{
        //存放需要建立索引的檔案的目錄路徑
        String filePath = "./fileDir";
        //呼叫索引器的建立索引方法
        Indexer.createIndex(filePath);
    }
}

這樣我們就對當前路徑下fileDir中的檔案建立了索引。

　　3.2執行搜尋

　　在Lucene中搜索像建立索引一樣簡單、快速。現在，我們建立一個搜尋器，搜尋包含特定文字的檔案。

　　1)使用QueryParser將查詢的關鍵詞解析成Lucene的查詢物件Query。建立QueryParser的時候我們需要用到分詞器，這個分詞器要和前面建立索引的時候使用的分詞器一致。

　　2)使用FSDirectory開啟索引所在的目錄。

　　3)使用IndexReader讀取索引目錄和使用IndexSearcher進行搜尋。

　　4)返回搜尋結果物件TopDocs。TopDocs包含搜尋到結果總數和結果的集合ScoreDocs陣列

　　5)遍歷結果的集合ScoreDocs陣列，根據每一個ScoreDoc的文件編號獲取Document

　　看看搜尋器的程式碼：
　　

/**
 * 搜尋器
 * @author Luxh
 */
public class Searcher {

    /**
     * 搜尋
     * @param keyWord 搜尋的關鍵詞
     * @param indexDir  索引目錄所在路徑
     * @throws ParseException
     * @throws IOException
     * @return List<Document>
     */
    public static List<Document> search(String keyWord,String indexDirPath) throws ParseException, IOException {

        String[] fields = {"name","content"};

        //建立一個分詞器,和建立索引時用的分詞器要一致
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

        //建立查詢解析器
        QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_36,fields,analyzer);

        //將查詢關鍵詞解析成Lucene的Query物件
        Query query = queryParser.parse(keyWord);

        //開啟索引目錄
        File indexDir = new File(indexDirPath);
        Directory directory = FSDirectory.open(indexDir);

        //獲取訪問索引的介面,進行搜尋
        IndexReader indexReader  = IndexReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        //TopDocs 搜尋返回的結果
        TopDocs topDocs = indexSearcher.search(query, 100);//只返回前100條記錄

        int totalCount = topDocs.totalHits; // 搜尋結果總數量
        System.out.println("搜尋到的結果總數量為：" + totalCount);

        ScoreDoc[] scoreDocs = topDocs.scoreDocs; // 搜尋的結果集合

        List<Document> docs = new ArrayList<Document>();

        for(ScoreDoc scoreDoc : scoreDocs) {
            //文件編號
            int docID = scoreDoc.doc;
            //根據文件編號獲取文件
            Document doc = indexSearcher.doc(docID);
            docs.add(doc);
        }
        indexReader.close();
        indexSearcher.close();
        return docs;
    }
}

接下來我們執行搜尋器：

public class TestSearcher {

    /**
     * 搜尋
     */
    @Test
    public void testSearch() throws IOException, ParseException{
        //搜尋關鍵詞
        String keyWord = "Java";
        //索引目錄路徑
        String indexDirPath = "./indexDir";
        //調用搜索器進行搜尋
        List<Document> docs = Searcher.search(keyWord, indexDirPath);
        for(Document doc : docs) {
            System.out.println("檔名 ： "+doc.get("name"));
            System.out.println("路徑 ： "+doc.get("path"));
            System.out.println("內容 ： "+doc.get("content"));
        }
    }


}

如果有包含關鍵詞的檔案，就會被搜尋出來了。

關於Lucene以及索引和搜尋的流程

關於Lucene以及索引和搜尋的流程

【Lucene&&Solr】Lucene索引和搜尋流程

lucene給文字索引和搜尋功能的應用

Lucene實現索引和查詢

Atitit 現代資訊檢索 Atitit 重要章節息檢索建模檢索評價第8章　文字分類 Line 210: 第9章　索引和搜尋第11章　Web檢索第13章　結構化文字檢索目錄 L

Solr筆記四之Solrj建立索引和搜尋的一般步驟

Lucene建立索引和索引的基本檢索

Lucene 6.2.1入門教程（一）建立索引和基本搜尋索引

lucene創建索引以及索引文件合並

Sql Server 索引以及頁和區

菜鳥學習lucene之搜尋流程（二）

普通索引和唯一索引的區別、性能差異，以及其他索引簡介

Lucene建立索引流程

lucene索引並搜尋mysql資料庫

mysql索引型別Normal,Unique,Full Text區別以及索引方法btree索引和hash的區別

lucene倒排索引表搜尋原理

資料庫索引以及索引的實現(B+樹介紹，和B樹，區別）

B+樹以及非聚簇索引和聚簇索引

B-tree/b+tree 原理以及聚簇索引和非聚簇索引

倒排檔案索引結構和搜尋演算法

關於Lucene以及索引和搜尋的流程

相關推薦