ElasticSearch-分詞器analyzer
阿新 • • 發佈:2020-08-07
analyzer
分詞器使用的兩個情形:
1,Index time analysis. 建立或者更新文件時,會對文件進行分詞
2,Search time analysis. 查詢時,對查詢語句分詞
指定查詢時使用哪個分詞器的方式有:
- 查詢時通過analyzer指定分詞器
GET test_index/_search { "query": { "match": { "name": { "query": "lin", "analyzer": "standard" } } } }
- 建立index mapping時指定search_analyzer
PUT test_index { "mappings": { "doc": { "properties": { "title":{ "type": "text", "analyzer": "whitespace", "search_analyzer": "standard" } } } } }
索引時分詞是通過配置 Index mapping中的每個欄位的引數analyzer指定的
# 不指定分詞時,會使用預設的standard PUT test_index {"mappings": { "doc": { "properties": { "title":{ "type": "text", "analyzer": "whitespace" #指定分詞器,es內建有多種analyzer } } }}}
注意:
- 明確欄位是否需要分詞,不需要分詞的欄位將type設定為keyword,可以節省空間和提高寫效能。
_analyzer api
GET _analyze { "analyzer": "standard", "text": "this is a test" }# 可以檢視text的內容使用standard分詞後的結果
{ "tokens": [ { "token": "this", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ] }View Code
設定analyzer
PUT test { "settings": { "analysis": { #自定義分詞器 "analyzer": { # 關鍵字 "my_analyzer":{ # 自定義的分詞器 "type":"standard", #分詞器型別standard "stopwords":"_english_" #standard分詞器的引數,預設的stopwords是\_none_ } } } }, "mappings": { "doc":{ "properties": { "my_text":{ "type": "text", "analyzer": "standard", # my_text欄位使用standard分詞器 "fields": { "english":{ # my_text.english欄位使用上面自定義得my_analyzer分詞器 "type": "text", "analyzer": "my_analyzer" }}}}}}} POST test/_analyze { "field": "my_text", # my_text欄位使用的是standard分詞器 "text": ["The test message."] } -------------->[the,test,message] POST test/_analyze { "field": "my_text.english", #my_text.english使用的是my_analyzer分詞器 "text": ["The test message."] } ------------>[test,message]
ES內建了很多種analyzer。比如:
- standard 由以下組成
- tokenizer:Standard Tokenizer
- token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
-
analyzer API測試 : POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
結果為:
-
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "2", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "quick", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "foxes", "start_offset": 18, "end_offset": 23, "type": "<ALPHANUM>", "position": 4 }, { "token": "jumped", "start_offset": 24, "end_offset": 30, "type": "<ALPHANUM>", "position": 5 }, { "token": "over", "start_offset": 31, "end_offset": 35, "type": "<ALPHANUM>", "position": 6 }, { "token": "the", "start_offset": 36, "end_offset": 39, "type": "<ALPHANUM>", "position": 7 }, { "token": "lazy", "start_offset": 40, "end_offset": 44, "type": "<ALPHANUM>", "position": 8 }, { "token": "dog's", "start_offset": 45, "end_offset": 50, "type": "<ALPHANUM>", "position": 9 }, { "token": "bone", "start_offset": 51, "end_offset": 55, "type": "<ALPHANUM>", "position": 10 } ] }
View Code - whitespace 空格為分隔符
POST _analyze { "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } --> [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]
simple
POST _analyze { "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
stop 預設stopwords用_english_
POST _analyze { "analyzer": "stop", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } -->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] 可選引數: # stopwords # stopwords_path
keyword 不分詞的
POST _analyze { "analyzer": "keyword", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一條完整的語句
==================================================================================
第三方analyzer外掛---中文分詞(ik分詞器)
es內建很多分詞器,但是對中文分詞並不友好,例如使用standard分詞器對一句中文話進行分詞,會分成一個字一個字的。這時可以使用第三方的Analyzer外掛,比如 ik、pinyin等。這裡以ik為例
1,首先安裝外掛,重啟es:
# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip # /etc/init.d/elasticsearch restart
2,使用示例:
GET _analyze { "analyzer": "ik_max_word", "text": "你好嗎?我有一句話要對你說呀。" }
{ "tokens": [ { "token": "你好", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "好嗎", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 1 }, { "token": "我", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 2 }, { "token": "有", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 3 }, { "token": "一句話", "start_offset": 6, "end_offset": 9, "type": "CN_WORD", "position": 4 }, { "token": "一句", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 5 }, { "token": "一", "start_offset": 6, "end_offset": 7, "type": "TYPE_CNUM", "position": 6 }, { "token": "句話", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "句", "start_offset": 7, "end_offset": 8, "type": "COUNT", "position": 8 }, { "token": "話", "start_offset": 8, "end_offset": 9, "type": "CN_CHAR", "position": 9 }, { "token": "要對", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 10 }, { "token": "你", "start_offset": 11, "end_offset": 12, "type": "CN_CHAR", "position": 11 }, { "token": "說呀", "start_offset": 12, "end_offset": 14, "type": "CN_WORD", "position": 12 } ] } 分詞結果View Code
參考:https://github.com/medcl/elasticsearch-analysis-ik
還可以用內建的 character filter, tokenizer, token filter 組裝一個analyzer(custom analyzer)
- custom 定製analyzer,由以下幾部分組成
- 0個或多個e character filters
- 1個tokenizer
- 0個或多個 token filters