Lucene學習之拼寫檢查

阿新 • • 發佈：2019-01-02

在搜尋引擎中，我們往往會遇見下面的情景
這裡寫圖片描述

這裡寫圖片描述

這其實就是拼寫檢查的應用，lucene的suggest模組就是為此而設的。
首先需要的是一個有效的拼寫檢查的源詞典。

private static String dicpath = "G:\\downloads\\LJParser_release\\dictionary.dic";
//初始化字典目錄  
//最後一個fullMerge引數表示拼寫檢查索引是否需要全部合併  
// 一句話總結：indexDictionary就是將字典檔案裡的詞進行ngram操作後得到多個詞然後分別寫入索引。
spellchecker.indexDictionary(new 
 PlainTextDictionary(Paths.get(dicpath)),config,true); 
String[] suggests = spellchecker.suggestSimilar(word, numSug);  //用來計算最後返回的建議詞

使用ngram來標識類似的單詞，ngram簡單的來說就是表示一個單詞中一定長度的所有鄰接字母組合，比如lucene當ngram=3時，那麼字母組合包括:luc|uce|cen|ene所以ngram的選擇必定會影響到檢查匹配的效率。
那麼直接來看一下程式碼（程式碼借鑑自益達）

import java.io.IOException;
import 
 java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import 
 org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.spell.PlainTextDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

//Suggest模組下另一個功能：拼寫糾錯
public class SpellCheckTest {
    private static String dicpath = "G:\\downloads\\LJParser_release\\dictionary.dic";
    private Document document;
    private Directory directory = new RAMDirectory();
    private IndexWriter indexWriter;
    //拼寫檢查
    private SpellChecker spellchecker;
    private IndexSearcher indexSearcher;
    private IndexReader indexReader;
    //建立測試索引
    public void CreateIndex(String content) throws IOException{
        IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
        indexWriter = new IndexWriter(directory, config);
        document = new Document();
        document.add(new TextField("content",content,Store.YES));
        try {
            indexWriter.addDocument(document);
            indexWriter.commit();
            indexWriter.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public void search(String word, int numSug) {  
         directory = new RAMDirectory();  
         try {  
                IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());  
                spellchecker = new SpellChecker(directory);  
               //初始化字典目錄  
               //最後一個fullMerge引數表示拼寫檢查索引是否需要全部合併  
                // 一句話總結：indexDictionary就是將字典檔案裡的詞進行ngram操作後得到多個詞然後分別寫入索引。
               spellchecker.indexDictionary(new PlainTextDictionary(Paths.get(dicpath)),config,true);  
                //這裡的引數numSug表示返回的建議個數  
                String[] suggests = spellchecker.suggestSimilar(word, numSug);  //用來計算最後返回的建議詞
                //判斷兩個詞的相似度，預設實現是LevensteinDistance，至於LevensteinDistance演算法實現自己去看LevensteinDistance原始碼吧。
               if (suggests != null && suggests.length > 0) {  
                   for (String suggest : suggests) {  
                     System.out.println("您是不是想要找：" + suggest);  
                 }  
             }  
        } catch (IOException e) {  
             e.printStackTrace();  
        }  
    } 

    public static void main(String[] args) throws IOException {  
        SpellCheckTest spellCheckTest = new SpellCheckTest();  
        spellCheckTest.CreateIndex("《屌絲男士》不是傳統意義上的情景喜劇，有固定時長和單一場景，" +
                "以及簡單的生活細節。而是一部具有鮮明網路特點，舞臺感十足，" +
                "整體沒有劇情銜接，固定的演員演繹著並不固定角色的笑話集。");  
        spellCheckTest.CreateIndex("屌絲男士的拍攝構想，首先源於“屌絲文化”在中國的颳起的現象級春風，" +
                "紅透了整片天空，全中國上下可謂無人不屌絲，無人不愛屌絲。");  
        spellCheckTest.CreateIndex("德國的一部由女演員瑪蒂娜-希爾主演的系列短劇，憑藉其瘋癲荒誕、自high" +
                "耍賤、三俗無下限的表演風格，在中國取得了巨大成功，紅火程度遠遠超過了德國。不僅位居國內各" +
                "個視訊網站的下載榜和點播榜高位，且在微博和媒體間，引發了坊間熱議和話題傳播。網友們更是" +
                "形象地將其翻譯為《屌絲女士》，對其無比熱衷。於是我們決定著手拍一部屬於中國人，" +
                "帶強烈國人屌絲色彩的《屌絲男士》。");  

        String word = "吊絲男士";  
        spellCheckTest.search(word, 4);  
    }  


}

一、SpellChecker的建構函式需要傳入需要被檢查的文件索引；
這裡寫圖片描述
二、為字典建立索引（當然ngram分詞的過程被封裝了）

三、計算並返回相似的詞：

計算兩個詞相似度的實現採用LevensteinDistance進行字串相似度計算。LevensteinDistance就是edit distance（編輯距離）。編輯距離，又稱Levenshtein距離（也叫做Edit Distance），是指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數。許可的編輯操作包括將一個字元替換成另一個字元，插入一個字元，刪除一個字元。
例如將kitten一字轉成sitting：

　　sitten （k→s）

　　sittin （e→i）

　　sitting （→g）

　　俄羅斯科學家Vladimir Levenshtein在1965年提出這個概念。
以上就是簡單拼寫檢查的基本過程。
接下來看下選取不同ngram的結果：
這是字典檔案：
這裡寫圖片描述
ngram = 4時（編輯距離看作4）

ngram = 1時（編輯距離看作1）

Lucene學習之拼寫檢查

Lucene學習之拼寫檢查

lucene學習之Field常用型別

Lucene學習之高亮顯示

Lucene學習之Facet

lucene學習之建立自定義排序

lucene學習之針對多索引的搜尋

機器學習之抽樣檢查的重要性

lucene學習之近實時搜尋

solr入門之solr的拼寫檢查功能的應用級別嘗試

Pycharm之關閉拼寫檢查警告

Lucene學習總結之二：Lucene的總體架構

lucene 拼寫檢查

Lucene學習篇之高亮搜尋結果

Lucene學習篇之修改/刪除索引

Lucene學習總結之七：Lucene搜尋過程解析

Lucene學習總結之七：Lucene搜尋過程解析(1)

Lucene學習總結之七：Lucene搜尋過程解析(7)轉

Lucene學習總結之七：Lucene搜尋過程解析(5)

Lucene學習筆記之-核心資料結構PriorityQueue的實現原理

Lucene學習筆記之-核心數據結構PriorityQueue的實現原理

Lucene學習之拼寫檢查

相關推薦