ElasticSearch自定義分析器-整合結巴分詞外掛
關於結巴分詞 ElasticSearch 外掛:
https://github.com/huaban/elasticsearch-analysis-jieba
該外掛由huaban開發。支援Elastic Search 版本<=2.3.5。
結巴分詞分析器
結巴分詞外掛提供3個分析器:jieba_index、jieba_search和jieba_other。
- jieba_index: 用於索引分詞,分詞粒度較細;
- jieba_search: 用於查詢分詞,分詞粒度較粗;
- jieba_other: 全形轉半形、大寫轉小寫、字元分詞;
使用jieba_index或jieba_search分析器,可以實現基本的分詞效果。
以下是最小配置示例:
{ "mappings": { "test": { "_all": { "enabled": false }, "properties": { "name": { "type": "string", "analyzer": "jieba_index", "search_analyzer": "jieba_index" } } } } }
在生產化境中,因為業務的需要,需要考慮實現以下功能:
- 支援同義詞;
- 支援字元過濾器;
結巴外掛提供的分析器jieba_index、jieba_search無法實現以上功能。
自定義分析器
當jieba_index、jieba_search分析器不滿足生成環境的需求時,我們可以使用自定義分析器來解決以上問題。
分析器是由字元過濾器,分詞器,詞元過濾器組成的。
一個分詞器允許包含多個字元過濾器+一個分詞器+多個詞元過濾器。
因業務的需求,我們需要使用對映字元過濾器來實現分詞前某些字串的替換操作。如將使用者輸入的c#替換為csharp,c++替換為cplus。
下面逐一介紹分析器各個組成部分。
1. 對映字元過濾器Mapping Char Filter
這個是Elastic Search內建的對映字元過濾器,位於settings –> analysis -> char_filter下:
PUT /my_index { "settings": { "analysis": { "char_filter": { "mapping_filter": { "type": "mapping", "mappings": [ "c# => csharp", "c++ => cplus" ] } } } } }
也可以通過檔案載入字元對映表。
PUT /my_index { "settings": { "analysis": { "char_filter": { "mapping_filter": { "type": "mapping", "mappings_path": "mappings.txt" } } } } }
檔案預設存放config目錄下,即config/ mappings.txt。
2. 結巴分詞詞元過濾器JiebaTokenFilter
JiebaTokenFilter接受一個SegMode引數,該引數有兩個可選值:Index和Search。
我們預先定義兩個詞元過濾器:jieba_index_filter和jieba_search_filter。
PUT /my_index { "settings": { "analysis": { "filter": { "jieba_index_filter": { "type": "jieba", "seg_mode": "index" }, "jieba_search_filter": { "type": "jieba", "seg_mode": "search" } } } } }
這兩個詞元過濾器將分別用於索引分析器和查詢分析器。
3. stop 停用詞詞元過濾器
因分詞詞元過濾器JiebaTokenFilter並不處理停用詞。因此我們在自定義分析器時,需要定義停用詞詞元過濾器來處理停用詞。
Elastic Search提供了停用詞詞元過濾器,我們可以這樣來定義:
PUT /my_index { "settings": { "analysis": { "filter": { "stop_filter": { "type": "stop", "stopwords": ["and", "is", "the"] } } } } }
也可以通過檔案載入停用詞列表
PUT /my_index { "settings": { "analysis": { "filter": { "stop_filter": { "type": "stop", "stopwords_path": "stopwords.txt" } } } } }
檔案預設存放config目錄下,即config/ stopwords.txt。
4. synonym 同義詞詞元過濾器
我們使用ElasticSearch內建同義詞詞元過濾器來實現同義詞的功能。
PUT /my_index { "settings": { "analysis": { "filter": { "synonym_filter": { "type": "synonym", "stopwords": [ "中文,漢語,漢字" ] } } } } }
如果同義詞量比較大時,推薦使用檔案的方式載入同義詞庫。
PUT /my_index { "settings": { "analysis": { "filter": { "synonym_filter ": { "type": "synonym", "stopwords_path": "synonyms.txt" } } } } }
5. 重新定義分析器jieba_index和jieba_search
Elastic Search支援多級分詞,我們使用whitespace分詞作為分詞器;並在詞元過濾器加入定義好的Jiebie分詞詞元過濾器:jieba_index_filter和jieba_search_filter。
PUT /my_index { "settings": { "analysis": { "analyzer": { "jieba_index": { "char_filter": [ "mapping_filter" ], "tokenizer": "whitespace", "filter": [ "jieba_index_filter", "stop_filter", "synonym_filter" ] }, "jieba_search": { "char_filter": [ "mapping_filter" ], "tokenizer": "whitespace", "filter": [ "jieba_search_filter", "stop_filter", "synonym_filter" ] } } } } }
注意,上面分析器的命名依然使用jieba_index和jieba_search,以便覆蓋結巴分詞外掛提供的分析器。
當存在多個同名的分析器時,Elastic Search會優先使用索引配置中定義的分析器。
這樣在程式碼呼叫層面便無需再更改。
下面是完整的配置:
PUT /my_index { "settings": { "analysis": { "char_filter": { "mapping_filter": { "type": "mapping", "mappings_path": "mappings.txt" } } "filter": { "synonym_filter ": { "type": "synonym", "stopwords_path": "synonyms.txt" }, "stop_filter": { "type": "stop", "stopwords_path": "stopwords.txt" }, "jieba_index_filter": { "type": "jieba", "seg_mode": "index" }, "jieba_search_filter": { "type": "jieba", "seg_mode": "search" } } "analyzer": { "jieba_index": { "char_filter": [ "mapping_filter" ], "tokenizer": "whitespace", "filter": [ "jieba_index_filter", "stop_filter", "synonym_filter" ] }, "jieba_search": { "char_filter": [ "mapping_filter" ], "tokenizer": "whitespace", "filter": [ "jieba_search_filter", "stop_filter", "synonym_filter" ] } } } } }
參考資料:
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html
http://www.tuicool.com/articles/eUJJ3qF