Apache Lucene 全文檢索詳解及開發示例

阿新 • • 發佈：2019-01-10

講解之前，先來分享一些資料

　　首先呢，學習任何一門新的亦或是舊的開源技術，百度其中一二是最簡單的辦法，先了解其中的大概，思想等等。這裡就貢獻一個講解很到位的ppt。已經被我轉成了PDF，便於蒐藏。

　　其次，關於第一次程式設計初探，建議還是檢視官方資料。百度到的資料，目前Lucene已經更新到4.9版本，這個版本需要1.7以上的JDK，所以如果還用1.6甚至是1.5的小盆友，請參考低版本，由於我用的1.6，因此在使用Lucene4.0。

　　這是Lucene4.0的官網文件：http://lucene.apache.org/core/4_0_0/core/overview-summary.html

　　這裡非常佩服Lucene的開元貢獻者，可以閱讀Lucene in Action，作者最初想要寫軟體賺錢，最後貢獻給了Apache，跑題了。

　　最後，提醒學習Lucene的小盆友們，這個開源軟體的版本更新不慢，版本之間的程式設計風格亦是不同，所以如果百度到的帖子，可能這段程式碼，用了4.0或者3.6就會不好使。

　　比如，以前版本的申請IndexWriter時，是這樣的：

 IndexWriter indexWriter  =   new IndexWriter(indexDir,luceneAnalyzer, true );

　　但是4.0，我們需要配置一個conf，把配置內容放到這個物件中：

    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
    IndexWriter iwriter  
= new IndexWriter(directory, config);

　　所以，請一定要參考官方文件的程式設計風格，進行程式碼的書寫。

　　這是其中最常用的五個檔案：

　　第一個，也是最重要的，Lucene-core-4.0.0.jar,其中包括了常用的文件，索引，搜尋，儲存等相關核心程式碼。

　　第二個，Lucene-analyzers-common-4.0.0.jar，這裡麵包含了各種語言的詞法分析器，用於對檔案內容進行關鍵字切分，提取。

　　第三個，Lucene-highlighter-4.0.0.jar，這個jar包主要用於搜尋出的內容高亮顯示。

　　第四個和第五個，Lucene-queryparser-4.0.0.jar

，提供了搜尋相關的程式碼，用於各種搜尋，比如模糊搜尋，範圍搜尋，等等。

廢話說到這裡，下面我們簡單的講解一下什麼是全文檢索。

　　比如，我們一個資料夾中，或者一個磁碟中有很多的檔案，記事本、world、Excel、pdf，我們想根據其中的關鍵詞搜尋包含的檔案。例如，我們輸入Lucene，所有內容含有Lucene的檔案就會被檢查出來。這就是所謂的全文檢索。

　　因此，很容易的我們想到，應該建立一個關鍵字與檔案的相關對映，盜用ppt中的一張圖，很明白的解釋了這種對映如何實現。

　　在Lucene中，就是使用這種“倒排索引”的技術，來實現相關對映。

有了這種對映關係，我們就來看看Lucene的架構設計。

　　下面是Lucene的資料必出現的一張圖，但也是其精髓的概括。

　　我們可以看到，Lucene的使用主要體現在兩個步驟：

　　1 建立索引，通過IndexWriter對不同的檔案進行索引的建立，並將其儲存在索引相關檔案儲存的位置中。

　　2 通過索引查尋關鍵字相關文件。

　　下面針對官網上面給出的一個例子，進行分析：

 1 　　Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
 2 
 3     // Store the index in memory:
 4     Directory directory = new RAMDirectory();
 5     // To store an index on disk, use this instead:
 6     //Directory directory = FSDirectory.open("/tmp/testindex");
 7     IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
 8     IndexWriter iwriter = new IndexWriter(directory, config);
 9     Document doc = new Document();
10     String text = "This is the text to be indexed.";
11     doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
12     iwriter.addDocument(doc);
13     iwriter.close();
14     
15     // Now search the index:
16     DirectoryReader ireader = DirectoryReader.open(directory);
17     IndexSearcher isearcher = new IndexSearcher(ireader);
18     // Parse a simple query that searches for "text":
19     QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
20     Query query = parser.parse("text");
21     ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
22     assertEquals(1, hits.length);
23     // Iterate through the results:
24     for (int i = 0; i < hits.length; i++) {
25       Document hitDoc = isearcher.doc(hits[i].doc);
26       assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
27     }
28     ireader.close();
29     directory.close();

索引的建立

　　首先，我們需要定義一個詞法分析器。

　　比如一句話，“我愛我們的中國！”，如何對他拆分，扣掉停頓詞“的”，提取關鍵字“我”“我們”“中國”等等。這就要藉助的詞法分析器Analyzer來實現。這裡面使用的是標準的詞法分析器，如果專門針對漢語，還可以搭配paoding，進行使用。

1 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

　　引數中的Version.LUCENE_CURRENT，代表使用當前的Lucene版本，本文環境中也可以寫成Version.LUCENE_40。

　　第二步，確定索引檔案儲存的位置，Lucene提供給我們兩種方式：

　　1 本地檔案儲存

Directory directory = FSDirectory.open("/tmp/testindex");

　　2 記憶體儲存

Directory directory = new RAMDirectory();

　　可以根據自己的需要進行設定。

　　第三步，建立IndexWriter，進行索引檔案的寫入。

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);

　　這裡的IndexWriterConfig，據官方文件介紹，是對indexWriter的配置，其中包含了兩個引數，第一個是目前的版本，第二個是詞法分析器Analyzer。

　　第四步，內容提取，進行索引的儲存。

Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();

　　第一行，申請了一個document物件，這個類似於資料庫中的表中的一行。

　　第二行，是我們即將索引的字串。

　　第三行，把字串儲存起來（因為設定了TextField.TYPE_STORED,如果不想儲存，可以使用其他引數，詳情參考官方文件），並存儲“表明”為"fieldname".

　　第四行，把doc物件加入到索引建立中。

　　第五行，關閉IndexWriter,提交建立內容。

　　這就是索引建立的過程。

關鍵字查詢：

　　第一步，開啟儲存位置

DirectoryReader ireader = DirectoryReader.open(directory);

　　第二步，建立搜尋器

IndexSearcher isearcher = new IndexSearcher(ireader);

　　第三步，類似SQL，進行關鍵字查詢

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = isearcher.doc(hits[i].doc);
    assertEquals("This is the text to be indexed.",hitDoc.get("fieldname"));
}

　　這裡，我們建立了一個查詢器，並設定其詞法分析器，以及查詢的“表名“為”fieldname“。查詢結果會返回一個集合，類似SQL的ResultSet，我們可以提取其中儲存的內容。

　　關於各種不同的查詢方式，可以參考官方手冊，或者推薦的PPT

　　第四步，關閉查詢器等。

ireader.close();
directory.close();

　　最後，博豬自己寫了個簡單的例子，可以對一個資料夾內的內容進行索引的建立，並根據關鍵字篩選檔案，並讀取其中的內容。

建立索引：

/**
     * 建立當前檔案目錄的索引
     * @param path 當前檔案目錄
     * @return 是否成功
     */
    public static boolean createIndex(String path){
        Date date1 = new Date();
        List<File> fileList = getFileList(path);
        for (File file : fileList) {
            content = "";
            //獲取檔案字尾
            String type = file.getName().substring(file.getName().lastIndexOf(".")+1);
            if("txt".equalsIgnoreCase(type)){
                
                content += txt2String(file);
            
            }else if("doc".equalsIgnoreCase(type)){
            
                content += doc2String(file);
            
            }else if("xls".equalsIgnoreCase(type)){
                
                content += xls2String(file);
                
            }
            
            System.out.println("name :"+file.getName());
            System.out.println("path :"+file.getPath());
//            System.out.println("content :"+content);
            System.out.println();
            
            
            try{
                analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
                directory = FSDirectory.open(new File(INDEX_DIR));
    
                File indexFile = new File(INDEX_DIR);
                if (!indexFile.exists()) {
                    indexFile.mkdirs();
                }
                IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
                indexWriter = new IndexWriter(directory, config);
                
                Document document = new Document();
                document.add(new TextField("filename", file.getName(), Store.YES));
                document.add(new TextField("content", content, Store.YES));
                document.add(new TextField("path", file.getPath(), Store.YES));
                indexWriter.addDocument(document);
                indexWriter.commit();
                closeWriter();
    
                
            }catch(Exception e){
                e.printStackTrace();
            }
            content = "";
        }
        Date date2 = new Date();
        System.out.println("建立索引-----耗時：" + (date2.getTime() - date1.getTime()) + "ms\n");
        return true;
    }

進行查詢：

/**
     * 查詢索引，返回符合條件的檔案
     * @param text 查詢的字串
     * @return 符合條件的檔案List
     */
    public static void searchIndex(String text){
        Date date1 = new Date();
        try{
            directory = FSDirectory.open(new File(INDEX_DIR));
            analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
            DirectoryReader ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);
    
            QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content", analyzer);
            Query query = parser.parse(text);
            
            ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
        
            for (int i = 0; i < hits.length; i++) {
                Document hitDoc = isearcher.doc(hits[i].doc);
                System.out.println("____________________________");
                System.out.println(hitDoc.get("filename"));
                System.out.println(hitDoc.get("content"));
                System.out.println(hitDoc.get("path"));
                System.out.println("____________________________");
            }
            ireader.close();
            directory.close();
        }catch(Exception e){
            e.printStackTrace();
        }
        Date date2 = new Date();
        System.out.println("檢視索引-----耗時：" + (date2.getTime() - date1.getTime()) + "ms\n");
    }

全部程式碼:

View Code

執行結果：

　　所有包含man關鍵字的檔案，都被篩選出來了。

參考資料

JAVA讀取文字大全：http://blog.csdn.net/csh624366188/article/details/6785817

Lucene官方文件：http://lucene.apache.org/core/4_0_0/core/overview-summary.html

Apache Lucene 全文檢索詳解及開發示例

講解之前，先來分享一些資料

廢話說到這裡，下面我們簡單的講解一下什麼是全文檢索。

有了這種對映關係，我們就來看看Lucene的架構設計。

索引的建立

關鍵字查詢：

建立索引：

進行查詢：

全部程式碼:

執行結果：

參考資料

Apache Lucene 全文檢索詳解及開發示例

Django的haystack對接elasticsearch伺服器完成全文檢索詳解

flutter環境配置詳解及開發第一個專案

Django基礎(12): Request物件詳解及開發顯示使用者真實IP地址的APP

【Lucene】Apache Lucene全文檢索引擎架構之搜尋功能

【Lucene】Apache Lucene全文檢索引擎架構之入門實戰

WebService核心檔案【server-config.wsdd】詳解及呼叫示例

c/c++中define用法詳解及程式碼示例

搬家行業開發小程序系統（App.Config詳解及讀寫操作）

Android應用開發Scroller詳解及原始碼淺析

第20課 SPI協議詳解及裸機程式開發分析

Android 註解開發 ButterKnife使用詳解及教程

MiniGUI 移植到pc和arm開發板全過程詳解及錯誤解答

ndk開發中的Android.mk檔案與Application.mk詳解及例項

Swift藍芽開發詳解及示例

DELL EqualLogic PS6100存儲詳解及數據恢復解決辦法

Js中JSON.stringify()與JSON.parse()與eval()詳解及使用案例

JS函數動作分層結構詳解及Document.getElementById 釋義事件函數變量 script標簽 var function

AngularJS 過濾與排序詳解及實例代碼

Git服務器安裝詳解及安裝遇到問題解決方案【轉】

Apache Lucene 全文檢索詳解及開發示例

講解之前，先來分享一些資料

廢話說到這裡，下面我們簡單的講解一下什麼是全文檢索。

有了這種對映關係，我們就來看看Lucene的架構設計。

索引的建立

關鍵字查詢：

建立索引：

進行查詢：

全部程式碼:

執行結果：

參考資料

相關推薦