es - elasticsearch自定義分析器 - 內建分詞器
阿新 • • 發佈:2021-01-27
技術標籤:stack - eses
世界上並沒有完美的程式,但是我們並不因此而沮喪,因為寫程式就是一個不斷追求完美的過程。
自定義分析器 :
- Character filters :
1. 作用 : 字元的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
3. 內建字元過濾器 :
1. HTML Strip Character filter : 去除html標籤
2. Mapping Character filter : 對映替換
3. Pattern Replace Character filter : 正則替換 - Tokenizer :
1. 作用 :
1. 分詞
2. 記錄詞的順序和位置(短語查詢)
4. 記錄詞的型別(分類)
2. 數量限制 : 有且只能有一個
3. 分類 :
1. 完整分詞 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切詞 :
1. N-Gram
2. Edge N-Gram
3. 文字 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分詞的增、刪、改轉換
今天演示文字結構的分詞器 :
# keyword tokenizer
# 原樣返回
GET /_analyze
{
"tokenizer": "keyword",
"text": ["hello world", "我是中國人"]
}
# 結果
{
"tokens" : [
{
"token" : "hello world",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "我是中國人",
"start_offset" : 12,
"end_offset" : 17,
"type" : "word",
"position" : 101
}
]
}
# pattern tokenizer
# 基於正則的分詞,取正則匹配為詞
# 配置項 :
# 1. pattern : 正則表示式
# 2. flags
# 3. group : 組的個數,預設 -1
GET /_analyze
{
"tokenizer": {
"type" : "pattern",
"pattern" : "((?:[a-z0-9])+)",
"group" : 1
},
"text": ["hello 23456"]
}
# 結果
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "23456",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
}
]
}
# simple pattern analyzer
# 使用lucene的正則,取匹配正則的詞
# 必須指定pattern
# 配置項 : pattern
GET /_analyze
{
"tokenizer": {
"type" : "simple_pattern",
"pattern" : "[0-9]{3}"
},
"text": ["3456786544433 fsdfsd"]
}
# 結果
{
"tokens" : [
{
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "678",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "654",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "443",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
}
]
}
# char group tokenizer
# 基於指定字元分詞
# 配置項 :
# 1. tokenize_on_chars : 分詞的字元
# 2. max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "char_group",
"tokenize_on_chars" : [
"-", "whitespace", "_"
]
},
"text": ["sdjflds sdfsd-sdf-7879 fsd_us9098"]
}
# 結果
{
"tokens" : [
{
"token" : "sdjflds",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "sdfsd",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "sdf",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "7879",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 3
},
{
"token" : "fsd",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "us9098",
"start_offset" : 27,
"end_offset" : 33,
"type" : "word",
"position" : 5
}
]
}
# simple pattern split tokenizer
# 以正則分詞
# 配置項 : pattern
GET /_analyze
{
"tokenizer": {
"type" : "simple_pattern_split",
"pattern" : "[0-9]{3}"
},
"text": ["sdfsd23243sdfsd890sdfs"]
}
# 結果
{
"tokens" : [
{
"token" : "sdfsd",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "43sdfsd",
"start_offset" : 8,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "sdfs",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 2
}
]
}
# path hierarchy tokenizer
# 路徑拆分及轉換
# 配置項 :
# 1. delimiter : 拆分字元
# 2. replacement : 替換字元
# 3. buffer_size : 不建議修改
# 4. reverse : 預設 false
# 1. 分出的路徑的顯示順序反轉
# 2. skip的位置的反轉
# 5. skip : 初始位置,預設 0
GET /_analyze
{
"tokenizer": {
"type" : "path_hierarchy",
"delimiter" : "-",
"replacement" : "/",
"reverse" : true,
"skip" : 1
},
"text": ["hello-good-this-is-me"]
}
# 結果
{
"tokens" : [
{
"token" : "hello/good/this/is/",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "good/this/is/",
"start_offset" : 6,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "this/is/",
"start_offset" : 11,
"end_offset" : 19,
"type" : "word",
"position" : 0
},
{
"token" : "is/",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}