Lucene&全文檢索

阿新 • • 發佈：2019-01-10

目錄結構:
1.全文檢索
2.Lucene入門
3.Lucene進階

全文檢索

一, 生活中的搜尋:
1.Windows系統中的有搜尋功能：開啟“我的電腦”，按“F3”就可以使用查詢的功能，查詢指定的檔案或資料夾。搜尋的範圍是整個電腦中的檔案資源。

2.Eclipse中的幫助子系統：點選HelpHelp Contents，可以查找出相關的幫助資訊。搜尋的範圍是Eclipse的所有幫助檔案。
搜尋引擎，如Baidu或Google等，可以查詢到網際網路中的網頁、PDF、DOC、PPT、圖片、音樂、視訊等。
3.Mac中的Spotlight搜尋
4.資料庫中檢索檢查某一個關鍵字的例子。
select * from topic where content like ‘%java%’
文字檢索,會使索引失效

存在問題:
1.搜尋速度慢
2.搜尋效果不好.
3.沒有相關度排序

二, 什麼是全文檢索？

全文檢索是指計算機索引程式通過掃描文章中的每一個詞，對每一個詞建立一個索引，指明該詞在文章中出現的次數和位置，當用戶查詢時，檢索程式就根據事先建立的索引進行查詢，並將查詢的結果反饋給使用者的檢索方式。這個過程類似於通過字典中的檢索字表查字的過程。

在說全文檢索之前我們先來了解一下資料分類

結構化資料:指具有固定格式或有限長度的資料，如資料庫，元資料等;
半結構化資料:半結構化資料
非結構化資料:指不定長或無固定格式的資料，如郵件，word文件等;
非結構化資料又一種叫法叫全文資料。從全文資料中進行檢索就叫全文檢索。
特點:只關注文字不考慮語義

三, 為什麼使用 ?
搜尋速度:將資料來源中的資料都通過全文索引

匹配效果:過詞元(term)進行匹配，通過語言分析介面的實現，可以實現對中文等非英語的支援。

相關度:有匹配度演算法，將匹配程度（相似度）比較高的結果排在前面。

適用場景:關係資料庫中進行模糊查詢時，資料庫自帶的索引將不起作用，此時需要通過全文檢索來提高速度；比如：
網站系統中針對內容的模糊查詢；
select * from article where content like ‘%上海平安%’
ERP系統中產品等資料的模糊查詢，BBS、BLOG中的文章搜尋等；
各種搜尋引擎執行依賴於全文檢索；
只對指定領域的網站進行索引與搜尋（即垂直搜尋，如“818工作搜尋”、“有道購物搜尋”）
要在word、pdf等各種各樣的資料格式中檢索內容；
其它場合：比如搜狐拼音輸入法、Google輸入法等。

四, 工作原理

1.如何查詢全文資料?

順序掃描法(Serial Scanning)：所謂順序掃描，比如要找內容包含某一個字串的檔案，就是一個文件一個文件的看，對於每一個文件，從頭看到尾，如果此文件包含此字串，則此文件為我們要找的檔案，接著看下一個檔案，直到掃描完所有的檔案。比如Window自帶的搜尋。
如何提升全文檢索的速度?

對非結構化資料順序掃描很慢，對結構化資料的搜尋卻相對較快（由於結構化資料有一定的結構可以採取一定的搜尋演算法加快速度），那麼把我們的非結構化資料想辦法弄得有一定結構不就行了嗎？關係資料庫中儲存的都是結構化資料，因此很檢索都比較快。
從非結構化資料中提取出的然後重新組織的資訊，我們稱之索引。
字典及圖書目錄的原理。

2.全文檢索的過程

索引建立:將現實世界中所有的結構化和非結構化資料提取資訊，建立索引的過程。
搜尋索引:就是得到使用者的查詢請求，搜尋建立的索引，然後返回結果的過程。

3.案例分析

索引檔案中應該存放什麼？
索引檔案中只需要存放單詞及文件編號即可
要查出即包含is，又包括 shanghai及pingan的文件，先獲得包含is的文件列表，再獲得包含shanghai及pingan的文件列表，最合做一個集合並運算，就得出文件1及文件3。

文件0
What is your name?
文件1
My name is shanghai pingan!
文件2
What is that?
文件3
It is shanghai pingan, ShangHai Pingan

首先將我們非結構化資料儲存到文件區

文件編號	內容
0	What is your name?
1	My name is shanghai pingan!
2	What is that?
3	It is shanghai pingan, ShangHai Pingan

如何建立索引？
第一步：分片語件（Tokenizer）對文件進行處理,此過程稱為Tokenize。
1. 將文件分成一個一個單獨的單詞。(用空格分開)
2. 去除標點符號。
3. 去除停詞(Stop word)。大量出現的助詞,比如is,it等。中文：的，了，呢
經過分詞(Tokenizer)後得到的結果稱為詞元(Token)。詞元(Token)如下：
shanghai,ShangHai,pingan,My,name,What,your,pingan

第二步：將得到的詞元(Token)傳給語言處理元件(Linguistic Processor)，對於英語，處理大致如下：
1. 變為小寫(Lowercase)。
2. 將單詞縮減為詞根形式，如“cars”到“car”等。這種操作稱為：stemming。
3. 將單詞轉變為詞根形式，如“drove”到“drive”等。這種操作稱為：lemmatization。
語言處理元件(linguistic processor)的結果稱為詞(Term)。結果如下：
shanghai,pingan,my,name,what,your

第三步：把得到的詞Term傳給索引元件(Indexer)處理,處理過程如下：
1、把得到的詞建立一個字典表

詞term	文件Document
what	0
name	0
My	1
name	1
shanghai	1
pingan	1
what	2
that	2
shanghai	3
pingan	3
shanghai	3
pingan	3

2、對字典按字母順序進行排序

詞term	文件Document
shanghai	1
shanghai	3
shanghai	3
pingan	1
pingan	3
pingan	3
my	1
name	0
name	1
what	0
what	2
your	0

3、合併相同的詞(Term)成為文件倒排(Posting List)連結串列。

詞term	出現次數	文件	Frequency	文件	Frequency
shanghai	3	1	1	3	2
pingan	3	1	1	3	2
my	1	1	1	~	~
name	2	0	1	1	1
what	2	0	1	2	1
your	1	0	1	~	~

最終會儲存兩部分一個文件區和一個索引區

詞元	文件編號
what	0,2
your	0
name	0,1
my	1
shanghai	1,3,3
pingan	1,3,3
that	2

搜尋處理的大致流程：
1、接收使用者輸入的搜尋詞及關鍵字並作簡單處理；
2、對查詢語句進行詞法分析，語法分析，及語言處理；
3、查詢到包含輸出詞的文件列表，並進行相關邏輯運算；
4、根據文件的相關性進行排序，把相關性最高的文件返回出來。

4.文件相關性

計算詞的權重:
1、找出詞(Term)對文件的重要性的過程稱為計算詞的權重(Term weight)的過程。主要有兩個因素：
A、Term Frequency (tf)：即此Term在此文件中出現了多少次。tf 越大說明越重要。
B、 Document Frequency (df)：即有多少文件包含該Term。df 越大說明越不重要。

2、判斷Term之間的關係從而得到文件相關性的過程，也即向量空間模型的演算法(VSM)。
實現方式：把文件看作一系列詞(Term)，每一個詞(Term)都有一個權重(Term weight)，不同的詞(Term)根據自己在文件中的權重來影響文件相關性的打分計算

5.全文檢索應用架構

6.全文檢索的流程對應的Lucene 實現的包結構

Lucene 的analysis 模組主要負責詞法分析及語言處理而形成Term。
Lucene的index模組主要負責索引的建立，裡面有IndexWriter。
Lucene的store模組主要負責索引的讀寫。
Lucene 的QueryParser主要負責語法分析。
Lucene的search模組主要負責對索引的搜尋。

Lucene入門

Lucene是什麼？

Lucene是一個用Java寫的高效能、可伸縮的全文檢索引擎工具包，它可以方便的嵌入到各種應用中實現針對應用的全文索引/檢索功能。Lucene的目標是為各種中小型應用程式加入全文檢索功能。

開發步驟

建立索引檔案

1,建立一個測試類LuceneTest
2,匯入jar包
lucene-core-4.10.4.jar 核心包
lucene-analyzers-common-4.10.4.jar 分詞器包
3,建立索引寫入器IndexWriter 傳入對應的引數:索引需要存放的位置,索引寫入器配置物件(配置版本,分詞器)
4.內容寫入之後,寫入到二進位制檔案中不方便檢視,使用工具(lukeall-4.10.0.jar)檢視索引庫

public class LuceneTest {
    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "hello";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置
        //建立索引寫入器配置物件
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, conf);
        //2.寫入文件資訊
        //新增文件 定義欄位的儲存規則
        FieldType type = new FieldType();
        type.setIndexed(true);//是否要索引
        type.setStored(true);//是否需要儲存
        Document document1 = new Document();//資料庫中的一條資料
        //new Field("欄位名","欄位內容","欄位的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的欄位 title:doc1
        document1.add(new Field("content", content1, type));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type));
        writer.addDocument(document3);

        //需要把新增的記錄儲存
        writer.commit();
        writer.close();
    }
}

執行測試類會在該專案目錄下生成一個hello資料夾

開啟_0.xfs檔案,這時我們看不出一個所以然

使用工具(lukeall-4.10.0.jar)檢視索引庫
只需在終端通過命令列 java -jar lukeall-4.10.0.jar 即可

需要在Path路徑上找到hello索引庫的絕對路徑

點選OK即可看到索引庫

查詢索引庫

0.匯入jar包lucene-queryparser-4.10.4.jar(將字串變成Query物件)
1.建立測試方法searchIndex()
2.建立索引查詢物件IndexSearcher
3.根據查詢的文字內容解析成Query查詢物件(匯入jar包lucene-queryparser-4.10.4.jar)設定查詢欄位,分詞器
4.根據查詢器查詢到文件編號
5.通過文件編號查詢對應的文件內容

//索引查詢過程
@Test
 public void searchIndex() throws Exception {
    //1.建立索引寫入器
    Directory d = FSDirectory.open(new File(indexPath));
    //建立分詞器
    Analyzer analyzer = new StandardAnalyzer();
    //開啟索引目錄
    IndexReader r = DirectoryReader.open(d);
    //建立索引查詢物件
    IndexSearcher searcher = new IndexSearcher(r);
    QueryParser parser = new QueryParser("content", analyzer);

    Query query = parser.parse("hello");//查詢hello
    //search(查詢物件,符合條件的前n條記錄)
    TopDocs search = searcher.search(query, 10000);//n:前幾個結果
    System.out.println("符合條件的記錄有多少個:" + search.totalHits);

    ScoreDoc[] scoreDocs = search.scoreDocs;
    for (int i = 0; i < scoreDocs.length; i++) {
        System.out.println("*******************************");
        System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
        int docId = scoreDocs[i].doc;//文件編號
        Document document = searcher.doc(docId);
        System.out.println("文件編號 docId--->" + docId);
        System.out.println("標題內容 title:--->" + document.get("content"));
    }
}

列印結果:

常用API

Directory:索引目錄用於存放lucene索引檔案
Directory是一個對索引目錄的一個抽象，索引目錄可以存放在普通的檔案中，也可以位於資料庫，或其它的遠端服務中；一般情況下均使用檔案來索引目錄，這時一個Directory就相當於一個資料夾。
SimpleFSDirectory：直接使用java.io.RandomAccessFile類來操作索引檔案，在普通的Lucene應用中，可以直接使用SimpleFSDirectory。

SimpleFSDirectory類：直接使用java.io.RandomAccessFile類來操作索引檔案，在普通的Lucene應用中，這是最簡單的用法。
建構函式：
SimpleFSDirectory(File path) ：直接根據一個資料夾地址來建立索引目錄；
MMapDirectory(File path) ：讓OS把整個索引檔案對映到虛擬地址空間，這樣Lucene就會覺得索引在記憶體中。

Document:當往索引中加入內容的時候，每一條資訊用一個子Document來表示,Document的意思表示文件，也可以理解成記錄，與關係資料表中的一行資料記錄類似；
在Document建立完以後，直接呼叫其提供的欄位操作方法來操作其中的欄位物件。
Document提供的方法主要包括：
欄位新增：add(Field field)
欄位刪除：removeField、removeFields
獲取欄位或值:get、getBinaryValue、getField、getFields等

**Field:**Field代表Document中的一行資料，相當於一條Lucene記錄中的一列。
Lucene提供了一個介面Fieldable，其它的API大多針對這個介面程式設計，因此Lucene中的列物件實際上是由Fieldable來定義，實現該介面的除了Field類，還包括NumericField等。在實際開發中，主要使用的是Field類。
Field類提供的常用構造方法：
1、Field(String name, String value, Field.Store store, Field.Index index) -通過欄位名稱，欄位值，儲存型別及索引方式來建立一個欄位；
2、Field(String name, byte[] value, Field.Store store) -通過欄位名稱、欄位值(位元組碼)及欄位儲存方式建立欄位物件；
3、Field(String name, Reader reader) -根據欄位名稱及Reader物件建立欄位物件；
4、其它構造方法，詳情檢視API。
new Field(“title”, “中國太平”, Store.NO, Index.ANALYZED);
new Field(“content”, “比較好的保險公司”, Store.YES, Index.ANALYZED);

**FieldType:**Lucene中，在建立Field的時候，可以指定Field的store及index屬性；
store屬性：表示欄位值是否儲存，True表示要儲存，而False則表示不儲存；
type.setStored(true);//是否需要儲存在文件區中
indexed屬性：表示欄位的是否需要建立索引，即是否支援搜尋。tokenized屬性：表示欄位是否需要根據Analyzer規則進行分詞

建立FieldTest測試類(複製上面的類修改類名)
定義欄位的儲存規則

  FieldType type2 = new FieldType();
  type2.setIndexed(true);//該欄位是否要索引
  type2.setStored(true);//是否需要儲存在文件區中
  type2.setTokenized(false);//欄位是否分詞
  type2.setTokenized(false);//欄位是否分詞

設定所有的欄位的配置屬性為type2

document1.add(new Field("content", content1, type2));
document2.add(new Field("content", content2, type2));
document3.add(new Field("content", content3, type2));

public class FieldTest {
    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "fieldType";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    //建立索引
    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置
        //建立索引寫入器配置物件
        IndexWriterConfig confg = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        confg.setOpenMode(IndexWriterConfig.OpenMode.CREATE);//索引每次重新建立
        IndexWriter writer = new IndexWriter(d, confg);
        //2.寫入文件資訊
        //新增文件 定義欄位的儲存規則
        FieldType type = new FieldType();
        type.setIndexed(true);//該欄位是否要索引
        type.setStored(true);//是否需要儲存
        type.setTokenized(true);

        FieldType type2 = new FieldType();
        type2.setIndexed(true);//該欄位是否要索引
        type2.setStored(true);//是否需要儲存
        type2.setTokenized(false);//欄位是否分詞

        Document document1 = new Document();//資料庫中的一條資料
        //new Field("欄位名","欄位內容","欄位的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的欄位 title:doc1
        document1.add(new Field("content", content1, type2));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type2));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type2));
        writer.addDocument(document3);

        //需要把新增的記錄儲存
        writer.commit();
        writer.close();
    }
}

執行測試類

檢視索引庫

當我們搜尋使用者名稱或者地名希望是完整的詞元,不希望被分割,此時就可以設定該欄位的tokenize屬性為false,設定不進行分詞
在索引庫中:
1.標題和內容都通過分詞器進行索引了.
2.標題是完整儲存在文件區中,內容值擷取前30個字元儲存在儲存區
3.文章ID只是儲存在文件區但是沒有進行分詞
4.時間,作者,閱讀量,評論數,來源是沒索引也沒儲存的

Analyzer(詞法分析器)

建立一個測試類AnalyzerTest
封裝一個測試各個分詞器的方法analyzerMethod(Analyzer analyzer, String content);

public class AnalyzerTest {
    String en = "good morning boy";
    String ch = "你好 恭喜發財 東方明珠三生三世十里桃花";

    @Test
    public void analyzerMethod(Analyzer analyzer, String content) throws Exception {

        TokenStream tokenStream = analyzer.tokenStream("content", content);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.println(tokenStream);
        }
    }

    //英文分詞器SimpleAnalyzer測試
    @Test
    public void testSimpleAnalyzer() throws Exception {
        analyzerMethod(new SimpleAnalyzer(), en);
    }
}

英文分詞:
SimpleAnalyzer:最簡單的詞法分析器，按英文單詞建立索引，以空格為分隔符；

  //英文分詞器SimpleAnalyzer測試
    @Test
    public void testSimpleAnalyzer() throws Exception {
        analyzerMethod(new SimpleAnalyzer(), en);
    }

StandardAnalyzer:按英文單詞及中文字元來進行分析。

 //英文分詞器StandardAnalyzer測試
    @Test
    public void testStandardAnalyzer() throws Exception {
        analyzerMethod(new StandardAnalyzer(), en);
    }

對於英文StandardAnalyzer也是採取空格進行分詞
下面對中文進行分詞測試(對於中文他是單字分詞)

 //英文分詞器StandardAnalyzer測試
    @Test
    public void testStandardAnalyzer() throws Exception {
        analyzerMethod(new StandardAnalyzer(), ch);
    }

PerFieldAnalyzerWrapper:

public void testPerFieldAnalyzerWrapper() throws Exception {
  Map<String, Analyzer> analyzerMap = new HashMap<>();
  analyzerMap.put("en", new SimpleAnalyzer());//使用SimpleAnalyzer分詞器
  analyzerMap.put("ch", new StandardAnalyzer());//使用StandardAnalyzer
  //設定預設分詞器
  PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), analyzerMap);
   //會根據傳入的欄位名在PerFieldAnalyzerWrapper找到這個欄位對應的分詞器
   //如果PerFieldAnalyzerWrapper沒有該欄位對應的分詞器就會應用預設的的分詞器
   //tokenStream("content", xxxxxxxxx);根據xxxxxx來判斷選擇的分詞器
   TokenStream tokenStream = wrapper.tokenStream("content", ch);
   tokenStream.reset();
   while (tokenStream.incrementToken()) {
     System.out.println(tokenStream);
   }
 }

中文分詞:
StandardAnalyzer:單字分詞，把每一個字當成一個詞

//中文分詞器StandardAnalyzer測試
@Test
public void testStandardAnalyzer() throws Exception {
   analyzerMethod(new StandardAnalyzer(), ch);
 }

CJKAnalyzer:二分法分詞，把相臨的兩個字當成一個詞，比如我們是中國人；我們，們是，是中，中國，國人等

//中文分詞器CJKAnalyzer測試
@Test
public void testCJKAnalyzer() throws Exception {
analyzerMethod(new CJKAnalyzer(), ch);
}

SmartChineseAnalyzer:字典分詞，也叫詞庫分詞；把中文的詞全部放置到一個詞庫中，按某種演算法來維護詞庫內容；如果匹配到就切分出來成為詞語。通常詞庫分詞被認為是最理想的中文分詞演算法。如：“我們是中國人”，效果為：“我們”、“中國人”。（可以使用SmartChineseAnalyzer，“極易分詞” MMAnalyzer ，或者是“庖丁分詞”分詞器、IKAnalyzer。推薦使用IKAnalyzer ）

//中文分詞器SmartChineseAnalyzer測試
//需要匯入jar包lucene-analyzers-smartcn-4.10.4.jar
@Test
public void testSmartChineseAnalyzer() throws Exception {
   analyzerMethod(new SmartChineseAnalyzer(), ch);
  }
}

IKAnalyzer:第三方的
1.匯入jar包 IKAnalyzer2012FF_u1.jar(這個包在中央倉庫是沒有的)支援停詞和自定義拓展詞
2.新增停詞詞典stopword.dic
3.新增拓展詞典ext.dic

//中文分詞器IKAnalyzer測試
//需要匯入jar包IKAnalyzer2012FF_u1.jar
 @Test
 public void testIKAnalyzer() throws Exception {
   analyzerMethod(new IKAnalyzer(), ch);
 }

如果想去掉”的”,”了”,”嗎”…..的語氣詞我們可以加入配置檔案
IKAnalyzer.cfg.xml和stopword.dic

在stopword.dic檔案裡新增我們不需要的分詞即可,這樣拆分詞元就不會把這些停詞作為分詞了

我們如果想加入一些我們自己需要的詞元則需要在配置檔案IKAnalyzer.cfg.xml中配置一個額外分詞檔案拓展詞典ext.dic
在拓展詞典ext.dic中設定我們自定義的詞元

索引庫的更新

public class CRUDTest {

    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "luncecrud";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    //建立索引
    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置
        //建立索引寫入器配置物件
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, conf);
        //2.寫入文件資訊
        //新增文件 定義欄位的儲存規則
        FieldType type = new FieldType();
        type.setIndexed(true);//是否要索引
        type.setStored(true);//是否需要儲存
        Document document1 = new Document();//資料庫中的一條資料
        //new Field("欄位名","欄位內容","欄位的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的欄位 title:doc1
        document1.add(new Field("content", content1, type));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type));
        writer.addDocument(document3);

        //需要把新增的記錄儲存
        writer.commit();
        writer.close();
        testSearch();
    }

    @Test
    public void testUpdate() throws Exception {
        //建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, config);
        //更新物件
        Term term = new Term("title", "doc2");//更新的條件
        Document updateDoc = new Document();//更新之後的文件物件
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setStored(true);
        updateDoc.add(new Field("title", "doc2", type));
        updateDoc.add(new Field("content", "hello黃河之水天上來吧我要更新內容啦", type));
        writer.updateDocument(term, updateDoc);
        //提交更新內容 釋放資源
        writer.commit();
        writer.close();
        testSearch();
    }

    //索引查詢過程
    @Test
    public void testSearch() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));

        //開啟索引目錄
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);
        QueryParser parser = new QueryParser("content", analyzer);

        Query query = parser.parse("hello");//查詢hello
        //search(查詢物件,符合條件的前n條記錄)
        TopDocs search = searcher.search(query, 10000);//n:前幾個結果
        System.out.println("符合條件的記錄有多少個:" + search.totalHits);
        ScoreDoc[] scoreDocs = search.scoreDocs;
        Document doc = null;
        for (int i = 0; i < scoreDocs.length; i++) {
            System.out.println("*******************************");
            System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
            int docId = scoreDocs[i].doc;//文件編號
            Document document = searcher.doc(docId);
            System.out.println("文件編號 docId--->" + docId);
            System.out.println("標題內容 title:--->" + document.get("title"));
            System.out.println("正文內容 content:--->" + document.get("content"));
        }
    }
}

先建立一個建立索引的方法testCreateIndex()和索引查詢的方法testSearch()然後建立一個索引更新的方法testUpdate();
先執行testCreateIndex()

在執行testUpdate();

把文件標題為doc2 的內容更新為新的內容,同時文件編號發生變化,文件編號為1的被刪除,增加類文件編號3.說明更新的操作是先刪除後新增

刪除索引庫

 @Test
    public void testDelete()throws Exception{
        //建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, config);
        //刪除記錄
        /**
         * 方式一
         Term term=new Term("title","doc2");
         writer.deleteDocuments(term);
         */
        //方式二
        QueryParser parser = new QueryParser("title", analyzer);
        Query query = parser.parse("doc3");
        writer.deleteDocuments(query);

        //將刪除操作提交
        writer.commit();
        writer.close();
        testSearch();
    }

Lucene進階

查詢所有

//索引查詢過程1
public void search1(String content) throws Exception {
  //1.建立索引寫入器
  Directory d = FSDirectory.open(new File(indexPath));
  //建立分詞器
  Analyzer analyzer = new StandardAnalyzer();
  //開啟索引目錄
  IndexReader r = DirectoryReader.open(d);
  IndexSearcher searcher = new IndexSearcher(r);
  QueryParser parser = new QueryParser("content", analyzer);

  Query query = parser.parse(content);//查詢hello
  //search(查詢物件,符合條件的前n條記錄)
  TopDocs search = searcher.search(query, 10000);//n:前幾個結果
  System.out.println("符合條件的記錄有多少個:" + search.totalHits);

  ScoreDoc[] scoreDocs = search.scoreDocs;
    for (int i = 0; i < scoreDocs.length; i++) {
    System.out.println("*******************************");
    System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
    int docId = scoreDocs[i].doc;//文件編號
    Document document = searcher.doc(docId);
    System.out.println("文件編號 docId--->" + docId);
    System.out.println("標題內容 title:--->" + document.get("title"));
    System.out.println("正文內容 content:--->" + document.get("content"));
        }
    }


//索引查詢過程2
 public void search2(Query query) throws Exception {
//1.建立索引寫入器
 Directory d = FSDirectory.open(new File(indexPath));
//建立分詞器
 Analyzer analyzer = new StandardAnalyzer();
  //開啟索引目錄
 IndexReader r = DirectoryReader.open(d);
 IndexSearcher searcher = new IndexSearcher(r);
 QueryParser parser = new QueryParser("content", analyzer);
 //search(查詢物件,符合條件的前n條記錄)
 TopDocs search = searcher.search(query, 10000);//n:前幾個結果
 System.out.println("符合條件的記錄有多少個:" + search.totalHits);

 ScoreDoc[] scoreDocs = search.scoreDocs;
 for (int i = 0; i < scoreDocs.length; i++) {
    System.out.println("*******************************");
    System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
    int docId = scoreDocs[i].doc;//文件編號
    Document document = searcher.doc(docId);
    System.out.println("文件編號 docId--->" + docId);
    System.out.println("標題內容 title:--->" + document.get("title"));
    System.out.println("正文內容 content:--->" + document.get("content"));
    }
    }

@Test
public void test1() throws Exception {
 search1("*:*");//查詢所有,匹配所有欄位
 search2(new MatchAllDocsQuery());
 }

單詞搜尋

 /**
     * 單詞搜尋
     *
     * @throws Exception
     */
    @Test
    public void test2() throws Exception {
        //search("title:doc1"); --->public void search(String content)
        search(new TermQuery(new Term("title", "doc1")));//--->search(Query query)
    }

段落查詢

 /**
     * 段落查詢
     * @throws Exception
     */
    @Test
    public void test3() throws Exception {
     // search("content:\"hello world\"");
      PhraseQuery query =new PhraseQuery();
      query.add(new Term("content","hello"));
      query.add(new Term("content","world"));
      search(query);
    }

萬用字元檢索

/**
 * 萬用字元檢索
 * @throws Exception
 */
@Test
public void test4() throws Exception {
//查詢所有
//方式1
 search("l*ne");
//方式2
 search("luenc?");
//方式3
 WildcardQuery query = new WildcardQuery(new Term("content","l*ne"));
   search(query);
}

search(“l**ne”);中的 *表示多個字元
search(“luenc?”);中的?表示一個字元

單詞模糊查詢

Lucene支援單詞容錯content:lucenx ~1 表示支援單詞容錯一個字母,content:lucenx~N N最大值為2

@Test
public void test5() throws Exception{
search("content:lxcenX~2");
FuzzyQuery query = new FuzzyQuery(new Term("content","lucenx"),1);
search(query);
}

相似查詢在關鍵字後面使用 ~ （波浪線)符號，後面可以跟一個表示相似度的數字，比如~0.85 , ~ 0.3 , ~1，值在0-1之間，1表示非常相似度最高，預設為0.5。

@Test
public void test6() throws Exception{
search("lqcenX~1");
FuzzyQuery query = new FuzzyQuery(new Term("content","lqcenX"));
search(query);
}

段落查詢 (臨近查詢)

content:\”hello world\”~1 表示這個段落中間可以插入1個單詞
content:\”hello world\”~N 表示這個段落中間可以插入N個單詞

 /**
  * 段落查詢 (臨近查詢)
  * @throws Exception
  */
 @Test
 public void test7() throws Exception{
  //~1 表示這個段落中間可以插入一個單詞
  //content:\"hello world\"~N 表示這個段落中間可以插入N個單詞
  //search("content:\"hello world\"~1");
   PhraseQuery query = new PhraseQuery();
   query.add(new Term("content","hello"));
   query.add(new Term("content","world"));
   query.setSlop(1);//設定中間有一個停詞
   search(query);
 }

範圍檢索

 /**
  * 範圍檢索
  */
@Test
public void test8() throws Exception {
//  {:左開區間
//  }:右開區間
//  [:左閉區間
//  ]:右閉區間
//search("inputtime:{20101010 TO 20101012}");
//TermRangeQuery(查詢欄位,左邊的值,右邊的值,是否左閉區間,是否右閉區間);
  TermRangeQuery query = new TermRangeQuery("inputtime", new BytesRef("20101010"), new BytesRef("20101012"), false, false);
   search(query);
}

組合查詢

AND和&&:目標–>查詢出標題中包括One及內容中包括java的文件；
下面兩種情況均可：
title:one && content:java
title:one AND content:java

/**
 * 組合查詢AND和&&
 * @throws Exception
 */
 @Test
 public  void test9() throws Exception {
   //search("content:hello AND inputtime:{20101010 TO 20101012}");
    search("content:hello && inputtime:{20101010 TO 20101012}");
   /*
    BooleanQuery query = new BooleanQuery();
    query.add(new TermQuery(new Term("content","hello")), BooleanClause.Occur.MUST);
    query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.MUST);
    search(query);
   */
 }

OR和||:查詢出標題中包括One但內容中不包括java的文件；
預設情況下分詞組合即為邏輯或(OR)方式。
下面三種情況均可：
title:one || content:java
title:one OR content:java
title:one content:java

/**
 * 組合查詢OR和||
 * @throws Exception
 */
@Test
public  void test10() throws Exception {
//search("content:lucene OR inputtime:{20101010 TO 20101012}");
//search("content:lucene || inputtime:{20101010 TO 20101012}");
  BooleanQuery query = new BooleanQuery();
  query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);
  query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);
  search(query);
}

Not或!:查詢出標題中包括One但內容中不包括java的文件；
下面兩種情況均可：
title:one ! content:java
title:one NOT content:java

/**
 * 組合查詢OR和||
 * @throws Exception
 */
@Test
public  void test10() throws Exception {
  //search("content:lucene OR inputtime:{20101010 TO 20101012}");
  //search("content:lucene || inputtime:{20101010 TO 20101012}");
    BooleanQuery query = new BooleanQuery();
    query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);
    query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);
    search(query);
}

必須包括(+)及排除(-):目標—>查詢出標題中包括One但內容中不包括java的文件；
+title:one -content:title

增加權重

Luence允許我們在組合查詢中，指定某一個詞的相關性權重值，從而可以讓得到相關性高的結果;
要提升一個詞的相關性權重，則可以在關鍵詞的後面新增^n來實現。
比如查詢jakarta apache，如果要把jakarta 的相關性提高，則可以改為jakarta^4 apache
相關性權重也可以用於片語查詢，比如”jakarta apache”^4 “Apache Lucene” 將把與jakarta apache片語最相關的優先排列出來；
相關性權重值預設為1，一般要提升權重時均設定為大於1的整數；該值也可以為0-1的小數，但不能為負數。

/**
 *  增加權重
 * @throws Exception
 */
@Test
public  void test12() throws Exception {
 //search("content:lucene^10 java");
   BooleanQuery query = new BooleanQuery();
   TermQuery termQuery = new TermQuery(new Term("content", "lucene"));
   termQuery.setBoost(10);//該查詢物件新增權重
   query.add(termQuery, BooleanClause.Occur.SHOULD);
   query.add(new TermQuery(new Term("content","java")), BooleanClause.Occur.SHOULD);
  search(query);
}

特殊字元

由於| & ! + - ( ) 等符號在查詢表示式中被用做關鍵字，因此要查詢這些字元必須使用\來進行轉義處理。
當前Lucene查詢中的特殊字元：+ - && || ! ( ) { } [ ] ^ ” ~ * ? : \
比如，要查詢包括(1+1):2 的文件，需要使用到如下表達式:
(1+1):2

分組
使用括號()對查詢表示式分組Grouping
Lucene查詢語法中支援通過()來對查詢表示式進行分組，從而組合出各種複雜的查詢。
1、查詢出標題中包括one或two，但內容中不包括java的文件；
Query query=parser.parse(“title:(one OR two) NOT content:java”);

高亮實現

１、高亮的概述：從搜尋結果中擷取一部分摘要，並把符合條件的記錄新增高亮顯示；
高亮需要使用jar包lucene-highlighter-4.10.4.jar
２、高亮涉及的功能包括兩部分：Ａ、擷取摘要，Ｂ、高亮顯示

Formatter formatter = new SimpleHTMLFormatter("<font color=\"red\">","</font>");
Scorer scorer = new QueryScorer(query);
Highlighter hl = new Highlighter(formatter,scorer);
hl.setMaxDocCharsToAnalyze(20);
String str=hl.getBestFragment(new StandardAnalyzer(), "content",doc.get("content"));

Lucene&全文檢索

目錄結構:
1.全文檢索
2.Lucene入門
3.Lucene進階

全文檢索

Analyzer(詞法分析器)

索引庫的更新

刪除索引庫

Lucene進階

原文出處：

Lucene全文檢索引擎

Lucene 全文檢索入門

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

Lucene全文檢索框架

Lucene全文檢索學習

Lucene全文檢索隨筆

Lucene全文檢索入門使用

Lucene&全文檢索

Lucene全文檢索--實戰篇

Lucene全文檢索--理論篇

Apache Lucene 全文檢索詳解及開發示例

SSM(二)Lucene全文檢索

Lucene全文檢索工具包學習筆記總結

lucene全文檢索與資料庫檢索的區別

Lucene: 全文檢索的基本原理

【Lucene】Apache Lucene全文檢索引擎架構之搜尋功能

java Lucene全文檢索優化方法

Lucene全文檢索引擎工具包使用方法總結

【Lucene】Apache Lucene全文檢索引擎架構之入門實戰

全文檢索學習歷程目錄結構Lucene、ElasticSearch

Lucene&全文檢索

目錄結構: 1.全文檢索 2.Lucene入門 3.Lucene進階

全文檢索

Analyzer(詞法分析器)

索引庫的更新

刪除索引庫

Lucene進階

原文出處：

相關推薦

目錄結構:
1.全文檢索
2.Lucene入門
3.Lucene進階