ElasticSearch50:索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器
standard
standard tokenizer:以單詞的邊界進行切分
standard token filter:什麼都不做
lowercase token filter:將所有字母轉換成小寫
stop token filter(預設被禁用),移除停用詞,比如a the it等等
2.修改分詞器的設定
例子:啟用standard的基於english的分詞器的停用詞token filter
其中,es_std是這個分詞器的名稱
PUT /index0 { "settings": { "analysis": { "analyzer": { "es_std":{ "type":"standard", "stopwords":"_english_" } } } } }
測試:
使用standard分詞器分詞a little dog
GET /index0/_analyze { "analyzer":"standard", "text":"a little dog" } 執行結果: { "tokens": [ { "token": "a", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 }, { "token": "little", "start_offset": 2, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "dog", "start_offset": 9, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 } ] }
使用設定的es_std分詞器分詞a little dog,可以看到結果中,停用詞過濾了
GET /index0/_analyze
{
"analyzer":"es_std",
"text":"a little dog"
}
執行結果
{ "tokens": [ { "token": "little", "start_offset": 2, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "dog", "start_offset": 9, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 } ] }
3.定製化自己的分詞器
例子
char_filter:型別為mapping,定義自己的替換過濾器,這裡我們將&轉換為and,並將這個過濾器起名為&_to_and
my_stopwords:型別為stop,定義自己的停用詞,這裡我們設定了兩個停用詞a和the
my_analyzer:型別為customer,自定義分詞器,分詞前操作:html_strip過濾html程式碼標籤,&_to_and是我們自己定義的字元過濾器(將&提換成and),分詞使用standard,停用詞使用my_stopwords,並將所有的詞轉成小寫
PUT /index0
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and":{
"type":"mapping",
"mappings":["&=> and"]
}
},
"filter":{
"my_stopwords":{
"type":"stop",
"stopwords":["a","the"]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":["html_strip","&_to_and"],
"tokenizer":"standard",
"filter":["lowercase","my_stopwords"]
}
}
}
}
}
執行:報錯,索引已存在,
{
"error": {
"root_cause": [
{
"type": "index_already_exists_exception",
"reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
"index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
"index": "index0"
}
],
"type": "index_already_exists_exception",
"reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
"index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
"index": "index0"
},
"status": 400
}
我們先刪除這個索引 DELETE /index0,然後再執行
執行成功:
{
"acknowledged": true,
"shards_acknowledged": true
}
測試我們的分詞器my_analyzer:
模擬一段文字:tom and jery in the a house <a> & me HAHA
從執行結果中可以看出,a和the過濾了,HAHA轉成了小寫,&轉成了and,<a>標籤過濾了
GET /index0/_analyze
{
"analyzer": "my_analyzer",
"text":"tom and jery in the a house <a> & me HAHA"
}
執行結果
{
"tokens": [
{
"token": "tom",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "and",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jery",
"start_offset": 8,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "in",
"start_offset": 13,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "house",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "and",
"start_offset": 32,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "me",
"start_offset": 34,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "haha",
"start_offset": 37,
"end_offset": 41,
"type": "<ALPHANUM>",
"position": 9
}
]
}
4.在我們的索引中使用我們自定義的分詞器
設定mytype中的欄位content使用我們的自定義的分詞器my_analyzer
GET /index0/_mapping/my_type
{
"properties":{
"content":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}