es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2
阿新 • • 發佈:2021-01-29
技術標籤:stack - eses
世界上並沒有完美的程式,但是我們並不因此而沮喪,因為寫程式就是一個不斷追求完美的過程。
自定義分析器 :
- Character filters :
1. 作用 : 字元的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
3. 內建字元過濾器 :
1. HTML Strip Character filter : 去除html標籤
2. Mapping Character filter : 對映替換
3. Pattern Replace Character filter : 正則替換 - Tokenizer :
1. 作用 :
1. 分詞
2. 記錄詞的順序和位置(短語查詢)
4. 記錄詞的型別(分類)
2. 數量限制 : 有且只能有一個
3. 分類 :
1. 完整分詞 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切詞 :
1. N-Gram
2. Edge N-Gram
3. 文字 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分詞的增、刪、改轉換
3. 分類 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit
今天演示內容中 :
- common grams token filter
- conditional token filter
- 重點關注。
# classic token filter
# 作用 :
# 1. 刪除'及後面字元
# 2. 刪除縮寫間的點
# 適用 : classic分詞器
GET /_analyze
{
"tokenizer": "classic",
"filter": ["classic"],
"text": ["hello this is hi's good H.J.K.M. Q.U.I.C.K. "]
}
# 結果
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "this",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 11,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "hi",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<APOSTROPHE>",
"position" : 3
},
{
"token" : "good",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "HJKM",
"start_offset" : 24,
"end_offset" : 32,
"type" : "<ACRONYM>",
"position" : 5
},
{
"token" : "QUICK",
"start_offset" : 33,
"end_offset" : 43,
"type" : "<ACRONYM>",
"position" : 6
}
]
}
# common grams token filter
# 作用 :
# 1. 指定的詞與前後詞結合
# 2. 可以避免停用詞造成的損失
# 配置項 :
# 1. common_words : 要結合的詞
# 2. common_words_path : 要結合詞的路徑
# 3. ignore_case : 忽略大小寫,預設false
# 4. query_mode : 是否單獨顯示指定的結合的詞,預設false - 顯示
GET /_analyze
{
"tokenizer": "standard",
"filter": [{
"type" : "common_grams",
"common_words" : ["是", "的", "Is"],
"ignore_case" : true,
"query_mode" : true
}],
"text": ["我是中國人", "這是我的飯", "this is my food"]
}
# 結果
{
"tokens" : [
{
"token" : "我_是",
"start_offset" : 0,
"end_offset" : 2,
"type" : "gram",
"position" : 0
},
{
"token" : "是_中",
"start_offset" : 1,
"end_offset" : 3,
"type" : "gram",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "國",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "這_是",
"start_offset" : 6,
"end_offset" : 8,
"type" : "gram",
"position" : 105
},
{
"token" : "是_我",
"start_offset" : 7,
"end_offset" : 9,
"type" : "gram",
"position" : 106
},
{
"token" : "我_的",
"start_offset" : 8,
"end_offset" : 10,
"type" : "gram",
"position" : 107
},
{
"token" : "的_飯",
"start_offset" : 9,
"end_offset" : 11,
"type" : "gram",
"position" : 108
},
{
"token" : "this_is",
"start_offset" : 12,
"end_offset" : 19,
"type" : "gram",
"position" : 209
},
{
"token" : "is_my",
"start_offset" : 17,
"end_offset" : 22,
"type" : "gram",
"position" : 210
},
{
"token" : "my",
"start_offset" : 20,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 211
},
{
"token" : "food",
"start_offset" : 23,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 212
}
]
}
# conditional token filter
# 作用 : 條件過濾,以條件判斷是否執行過濾器中的內容
# 配置項 :
# 1. filter : 過濾器
# 2. script : 過濾指令碼
GET /_analyze
{
"tokenizer": "standard",
"filter": [{
"type" : "condition",
"filter" : ["lowercase"],
"script" : {
"source": "token.getTerm().length() < 5"
}
}],
"text": ["THE QUICK BROWN FOX"]
}
# 結果
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "QUICK",
"start_offset" : 4,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "BROWN",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "fox",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
# decimal digit token filter
# 作用 : 特殊數字字元轉為阿拉伯數字
GET /_analyze
{
"tokenizer": "keyword",
"filter": ["decimal_digit"],
"text": ["6.7 १ १-one two-२ ३ "]
}
# 結果
{
"tokens" : [
{
"token" : "6.7 1 1-one two-2 3 ",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}