elasticsearch分詞檢索的match-query匹配過程分析
阿新 • • 發佈:2019-02-17
1. 模擬字串資料儲存
localhost:9200/yigo-redist.1/_analyze?analyzer=default&text=全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片
- 索引為`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default`
- 解析的字串(`text`)為"全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片"
如果結果為:
{ "tokens" : [ { "token" : "全能", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "片", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "前", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 3 }, { "token" : "trw-gdb7891at", "start_offset" : 9, "end_offset" : 22, "type" : "LETTER", "position" : 4 }, { "token" : "剎車片", "start_offset" : 22, "end_offset" : 25, "type" : "CN_WORD", "position" : 5 }, { "token" : "自帶", "start_offset" : 25, "end_offset" : 27, "type" : "CN_WORD", "position" : 6 }, { "token" : "報警", "start_offset" : 27, "end_offset" : 29, "type" : "CN_WORD", "position" : 7 }, { "token" : "線", "start_offset" : 29, "end_offset" : 30, "type" : "CN_CHAR", "position" : 8 }, { "token" : "無", "start_offset" : 31, "end_offset" : 32, "type" : "CN_WORD", "position" : 9 }, { "token" : "單獨", "start_offset" : 32, "end_offset" : 34, "type" : "CN_WORD", "position" : 10 }, { "token" : "報警", "start_offset" : 34, "end_offset" : 36, "type" : "CN_WORD", "position" : 11 }, { "token" : "線", "start_offset" : 36, "end_offset" : 37, "type" : "CN_CHAR", "position" : 12 }, { "token" : "號碼", "start_offset" : 37, "end_offset" : 39, "type" : "CN_WORD", "position" : 13 }, { "token" : "卡", "start_offset" : 40, "end_offset" : 41, "type" : "CN_CHAR", "position" : 14 }, { "token" : "仕", "start_offset" : 41, "end_offset" : 42, "type" : "CN_WORD", "position" : 15 }, { "token" : "歐", "start_offset" : 42, "end_offset" : 43, "type" : "CN_WORD", "position" : 16 }, { "token" : "卡", "start_offset" : 44, "end_offset" : 45, "type" : "CN_CHAR", "position" : 17 }, { "token" : "仕", "start_offset" : 45, "end_offset" : 46, "type" : "CN_WORD", "position" : 18 }, { "token" : "歐", "start_offset" : 46, "end_offset" : 47, "type" : "CN_WORD", "position" : 19 }, { "token" : "乘用車", "start_offset" : 48, "end_offset" : 51, "type" : "CN_WORD", "position" : 20 }, { "token" : "剎車片", "start_offset" : 52, "end_offset" : 55, "type" : "CN_WORD", "position" : 21 } ] }
2. 關鍵詞查詢
localhost:9200//yigo-redist.1/_analyze?analyzer=default_search&text=gdb7891
- 索引為`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
- 解析的字串(`text`)為"gdb7891"
{ "tokens" : [ { "token" : "gdb7891", "start_offset" : 0, "end_offset" : 7, "type" : "LETTER", "position" : 1 } ] }
3. 關鍵詞使用儲存的分詞器查詢
localhost:9200//yigo-redist.1/_analyze?analyzer=default&text=gdb7891
- 索引為`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
- 解析的字串(`text`)為"gdb7891"
{ "tokens" : [ { "token" : "gdb7891", "start_offset" : 0, "end_offset" : 7, "type" : "LETTER", "position" : 1 }, { "token" : "", "start_offset" : 0, "end_offset" : 7, "type" : "LETTER", "position" : 1 }, { "token" : "gdb7891", "start_offset" : 0, "end_offset" : 7, "type" : "LETTER", "position" : 1 }, { "token" : "", "start_offset" : 0, "end_offset" : 3, "type" : "ENGLISH", "position" : 2 }, { "token" : "gdb", "start_offset" : 0, "end_offset" : 3, "type" : "ENGLISH", "position" : 2 }, { "token" : "gdb", "start_offset" : 0, "end_offset" : 3, "type" : "ENGLISH", "position" : 2 }, { "token" : "7891", "start_offset" : 3, "end_offset" : 7, "type" : "ARABIC", "position" : 3 }, { "token" : "7891", "start_offset" : 3, "end_offset" : 7, "type" : "ARABIC", "position" : 3 }, { "token" : "", "start_offset" : 3, "end_offset" : 7, "type" : "ARABIC", "position" : 3 } ] }
總結
- 通過步驟1可以看出,儲存的資料"全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片",被拆分成了很多片語碎片,然後儲存在了索引資料中
- 通過步驟2可以看出,當關鍵詞輸入"gdb7891",這個在檢索分詞器(`default_search`)下,沒有拆分,只一個可供查詢的碎片就是"gdb7891",但是步驟1,拆分的碎片裡不存在"gb7891"的片語碎片,唯一相近的就是"trw-gdb7891at",所以使用普通的match-query是無法匹配步驟1輸入的索引資料
- 通過步驟3,可以看出如果使用相同的分詞器,"gdb7891"能夠拆分成"gdb","7891"等等,通過這2個碎片都能找到步驟1輸入的索引資料,但是因為關鍵詞被拆分了,所以會查詢到更多的匹配的資料,比如:與"gdb"匹配的,與"7891"匹配的,與"gdb7891"匹配的
- 如果說想通過分詞器(`default_search`)檢索出步驟1的資料,需要使用wildcard-query,使用"*gdb7891*",就可以匹配
{
"query": {
"wildcard" : { "description" : "*gdb7891*" }
}
}