es - elasticsearch 自定義分析器 - 字元過濾器
阿新 • • 發佈:2021-01-24
技術標籤:stack - eses
世界上並沒有完美的程式,但是我們並不因此而沮喪,因為寫程式就是一個不斷追求完美的過程。
官網地址 :https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html
自定義分析器主要由以下三部分組成 :
- Character filters :
1. 作用 : 字元的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
3. 內建字元過濾器 :
1. HTML Strip Character filter : 去除html標籤
2. Mapping Character filter : 對映替換 - Tokenizer :
1. 作用 : 分詞
2. 數量限制 : 有且只能有一個 - Token filters :
1. 作用 : 分詞的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
今天重點演示character filter字元過濾器,以下為使用例項 :
# HTML Strip Character filter
# 作用 : 去除html標籤
# 配置項 :
# 1. escaped_tags : 要跳過的html元素
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [{
"type" : "html_strip",
"escaped_tags" : ["p"]
}],
"text": ["<html><p>this is my name - 這是我的名字</p></html>",
"<p>I'm so <b>happy</b>!</p>" ]
}
# 結果
{
"tokens" : [
{
"token" : """
<p>this is my name - 這是我的名字</p>
""",
"start_offset" : 0,
"end_offset" : 44,
"type" : "word",
"position" : 0
},
{
"token" : "<p>I'm so happy!</p>",
"start_offset" : 45,
"end_offset" : 77,
"type" : "word",
"position" : 101
}
]
}
# Mappging Charater filter
# 作用 : 替換對映的字元
# 配置項 :
# 1. mappings
# 2. mappings_path
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [{
"type" : "mapping",
"mappings" : [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9",
"a => 97",
"y => 是"
]
}],
"text": "My license plate is ٢٥٠١٥"
}
# 結果
{
"tokens" : [
{
"token" : "M是 license pl97te is 25015",
"start_offset" : 0,
"end_offset" : 25,
"type" : "word",
"position" : 0
}
]
}
# Pattern Replace Character filter
# 作用 : 正則替換
# 配置項 :
# 1. pattern : 正則
# 2. replacement : 替換的詞
# 3. flags : java正則標記
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [{
"type" : "pattern_replace",
"pattern" : "\\d",
"replacement" : "2"
}, {
"type" : "pattern_replace",
"pattern" : "(\\d+)-(?=\\d)",
"replacement" : "$1_"
}],
"text": ["12dsfds", "My credit card is 123-456-789"]
}
# 結果
{
"tokens" : [
{
"token" : "22dsfds",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "My credit card is 222_222_222",
"start_offset" : 8,
"end_offset" : 37,
"type" : "word",
"position" : 101
}
]
}
# character filter 組合自定義分析器
PUT /test_char_filter
{
"settings": {"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer" : {
"tokenizer" : "keyword",
"char_filter" : [
"my_char_filter_html",
"my_char_filter_mapping",
"my_char_filter_pattern"
]
}
},
"char_filter": {
"my_char_filter_html" : {
"type" : "html_strip",
"escaped_tags" : "p"
},
"my_char_filter_mapping" : {
"type" : "mapping",
"mappings" : [
"a => 97",
"b => 98"
]
},
"my_char_filter_pattern" : {
"type" : "pattern_replace",
"pattern" : "-",
"replacement" : "_"
}
}
}
}
}
GET /test_char_filter/_analyze
{
"analyzer": "my_analyzer",
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>"]
}
# 結果
{
"tokens" : [
{
"token" : "this is 97 98 c d 2_3_5_7 <p>hello</p> good",
"start_offset" : 0,
"end_offset" : 54,
"type" : "word",
"position" : 0
}
]
}