es的分詞器analyzer
阿新 • • 發佈:2020-12-20
analyzer
分詞器使用的兩個情形:
1,Index time analysis. 建立或者更新文件時,會對文件進行分詞
2,Search time analysis. 查詢時,對查詢語句分詞
- 查詢時通過analyzer指定分詞器
GET test_index/_search { "query": { "match": { "name": { "query": "lin", "analyzer": "standard" } } } }
- 建立index mapping時指定search_analyzer
PUT test2 {"mappings": { "properties": { "title":{ "type": "text", "analyzer": "whitespace", "search_analyzer": "standard" } } } }
# 不指定分詞時,會使用預設的standard
注意:
- 明確欄位是否需要分詞,不需要分詞的欄位將type設定為keyword,可以節省空間和提高寫效能。
_analyzer api
GET _analyze { "analyzer": "standard", "text": "this is a test" }
# 可以檢視text的內容使用standard分詞後的結果
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, {"token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 3 } ] }
設定analyzer
PUT test3 { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"standard", "stopwords":"_english_" } } } }, "mappings": { "properties": { "my_text":{ "type": "text", "analyzer": "standard", "fields": { "english":{ "type": "text", "analyzer": "my_analyzer" } } } } } }
執行結果:
POST test3/_analyze { "field": "my_text", "text": ["The test message."] } { "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "test", "start_offset" : 4, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "message", "start_offset" : 9, "end_offset" : 16, "type" : "<ALPHANUM>", "position" : 2 } ] } POST test3/_analyze { "field": "my_text.english", "text": ["The test message."] } { "tokens" : [ { "token" : "test", "start_offset" : 4, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "message", "start_offset" : 9, "end_offset" : 16, "type" : "<ALPHANUM>", "position" : 2 } ] }
ES內建了很多種analyzer。比如:
- standard 由以下組成
- tokenizer:Standard Tokenizer
- token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
analyzer API測試 : POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
得到結果:
{ "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "quick", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "brown", "start_offset" : 12, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "foxes", "start_offset" : 18, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "jumped", "start_offset" : 24, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "over", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "lazy", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "dog's", "start_offset" : 45, "end_offset" : 50, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "bone", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ] }
- whitespace 空格為分隔符
POST _analyze { "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } --> [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]
- simple
POST _analyze { "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
- stop 預設stopwords用_english_
POST _analyze { "analyzer": "stop", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } -->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] 可選引數: # stopwords # stopwords_path
- keyword 不分詞的
POST _analyze { "analyzer": "keyword", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一條完整的語句
第三方analyzer外掛---中文分詞(ik分詞器)
es內建很多分詞器,但是對中文分詞並不友好,例如使用standard分詞器對一句中文話進行分詞,會分成一個字一個字的。這時可以使用第三方的Analyzer外掛,比如 ik、pinyin等。這裡以ik為例
1,首先安裝外掛,重啟es:
# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip # /etc/init.d/elasticsearch restart
2,使用示例:
GET _analyze { "analyzer": "ik_max_word", "text": "你好嗎?我有一句話要對你說呀。" }
GET _analyze
{
"analyzer": "ik_smart",
"text": "你好嗎?我有一句話要對你說呀。"
}
參考:https://github.com/medcl/elasticsearch-analysis-ik
還可以用內建的 character filter, tokenizer, token filter 組裝一個analyzer(custom analyzer)
- custom 定製analyzer,由以下幾部分組成
- 0個或多個e character filters
- 1個tokenizer
- 0個或多個 token filters
PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "char_filter":["html_strip"], "filter":["lowercase"] } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's <b> bone.</b>"] } 得到:[the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone]
自定義分詞器
自定義分詞需要在索引的配置中設定,如下所示:
PUT test_index { "settings": { "analysis": { # 分詞設定,可以自定義 "char_filter": {}, #char_filter 關鍵字 "tokenizer": {}, #tokenizer 關鍵字 "filter": {}, #filter 關鍵字 "analyzer": {} #analyzer 關鍵字 } } }
character filter 在tokenizer之前對原始文字進行處理,比如增加,刪除,替換字元等
會影響後續tokenizer解析的position和offset資訊
- html strip 除去html標籤和轉換html實體
- 引數:escaped_tags不刪除的標籤
POST _analyze { "tokenizer": "keyword", "char_filter": ["html_strip"], "text": ["<p>I'm so <b>happy</b>!</p>"] } 得到: "token": """ I'm so happy! """ #配置示例 PUT t_index { "settings": { "analysis": { "analyzer": { #關鍵字 "my_analyzer":{ #自定義analyzer "tokenizer":"keyword", "char_filter":["my_char_filter"] } }, "char_filter": { #關鍵字 "my_char_filter":{ #自定義char_filter "type":"html_strip", "escaped_tags":["b"] #不從文字中刪除的HTML標記陣列 } }}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["<p>I'm so <b>happy</b>!</p>"] } 得到: "token": """ I'm so <b>happy</b>! """,
- mapping 對映型別,以下引數必須二選一
- mappings 指定一組對映,每個對映格式為 key=>value
- mappings_path 絕對路徑或者相對於config路徑 key=>value
-
PUT t_index { "settings": { "analysis": { "analyzer": { #關鍵字 "my_analyzer":{ #自定義分詞器 "tokenizer":"standard", "char_filter":"my_char_filter" } }, "char_filter": { #關鍵字 "my_char_filter":{ #自定義char_filter "type":"mapping", "mappings":[ #指明對映關係 ":)=>happy", ":(=>sad" ] }}}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["i am so :)"] }
得到 [i,am,so,happy] - pattern replace
- pattern引數 正則
- replacement 替換字串 可以使用$1..$9
- flags 正則標誌
tokenizer 將原始文件按照一定規則切分為單詞
- standard
- 引數:max_token_length,最大token長度,預設是255
PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"my_tokenizer" } }, "tokenizer": { "my_tokenizer":{ "type":"standard", "max_token_length":5 }}}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ] # jumped 長度為6 在5這個位置被分割
- letter 非字母時分成多個terms
POST _analyze { "tokenizer": "letter", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 [ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
- lowcase 跟letter tokenizer一樣 ,同時將字母轉化成小寫
POST _analyze { "tokenizer": "lowercase", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } 得到 [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
- whitespace 按照空白字元分成多個terms
- 引數:max_token_length
POST _analyze { "tokenizer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } 得到 [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
- keyword 空操作,輸出完全相同的文字
- 引數:buffer_size,單詞一個term讀入緩衝區的長度,預設256
POST _analyze { "tokenizer": "keyword", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到"token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一個完整的文字
token filter 針對tokenizer 輸出的單詞進行增刪改等操作
- lowercase 將輸出的單詞轉化成小寫
POST _analyze { "filter": ["lowercase"], "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"] } ---> "token": "the 2 quick brown-foxes jumped over the lazy dog's bone" PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":"lowercase" } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"] }
- stop 從token流中刪除stop words 。
引數有:
# stopwords 要使用的stopwords, 預設_english_ # stopwords_path # ignore_case 設定為true則為小寫,預設false
# remove_trailingPUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":"my_filter" } }, "filter": { "my_filter":{ "type":"stop", "stopwords":["and","or","not"] } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["lucky and happy not sad"] }
-------------->
[lucky,happy,sad]