Lucene使用單字分詞及短語查詢實現類似全模糊查詢效果
阿新 • • 發佈:2019-01-24
Lucene使用全模糊查詢效率慢,現通過單字分詞,及短語查詢的方式達到類似效果,並極大的提高效率。
預期分詞效果:
中華人員共和國Chinese,Come On——>中/華/人/民/共/和/國/C/h/i/n/e/s/e/,/C/o/m/e/ /O/n。
缺點:索引檔案中存在大資料量的數字和英文時,用數字或英文查詢效率慢。
一、新建MyNGramAnalyzer類,實現單字分詞器
public final class MyNGramAnalyzer extends Analyzer { private Version version; public MyNGramAnalyzer(Version version) { this.version = version; } protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
/*new NGramTokenizer(version, reader, minGram, maxGram)實現單字分詞,minGram:最小分詞數,maxGram:最大分詞數
* 這裡都用1,表示對每個字元都分詞
*/ return new TokenStreamComponents(new NGramTokenizer(version, reader, 1, 1)); } }
二、使用MyNGramAnalyzer分詞器建立索引
public static void main(String[] args) {
try { //索引存放路徑Directory dir = FSDirectory.open(new File("d:/tool/index")); //lucene版本,這裡用的4.5Version version = Version.LUCENE_45; IndexWriterConfig iwc = new IndexWriterConfig(version,newMyNGramAnalyzer(version)); /* 索引的建立模式,CREATE:刪除原索引並新建; CREATE_OR_APPEND:如果原索引存在,就新增,不存在就新建 APPEND:在原索引上新增 */iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); IndexWriter writer = new IndexWriter(dir, iwc);
//一個Document就代表一條資料 Document document1 = new Document(); document1.add(new TextField("aaa","中華人員共和國Chinese,Come On", Field.Store.YES));
Document document2 = new Document();
document2.add(new TextField("bbb","中國Chinese", Field.Store.YES));writer.addDocument(document1);
writer.addDocument(document2);writer.commit();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
三、使用phraseQuery查詢索引
public static void main(String[] args) { //分詞查詢queryPhraseQuery pq = new PhraseQuery(); //查詢條件String queryParam = "和國Chinese,"; List<String> strList = new ArrayList<String>(); //將查詢條件中的每個字放到PhraseQuery中for(int i=0;i<queryParam.length();i=i+1){ String s = queryParam.substring(i,(i+1)>queryParam.length()?queryParam.length():(i+1)); pq.add(new Term("aaa",s)); } //開始查詢try { DirectoryReader directoryReader = DirectoryReader.open(FSDirectory.open(new File("d:/tool/index"))); IndexSearcher indexSearcher = new IndexSearcher(directoryReader); //第一引數為query,第二個為查詢條數TopDocs hits = indexSearcher.search(pq,10); for (int i=0; i<hits.totalHits; i++) { Document doc = indexSearcher.doc(hits.scoreDocs[i].doc); System.out.println(doc.get("aaa")); } } catch (IOException e) { e.printStackTrace(); } }
輸出結果:中華人員共和國Chinese,Come On