elasticsearch(11)通過ngram分詞機制實現搜尋推薦
阿新 • • 發佈:2018-11-16
轉載自簡書本文連結地址: Elasticsearch通過ngram分詞機制實現搜尋推薦
1、什麼是ngram
例如英語單詞 quick,5種長度下的ngram
ngram length=1,q u i c k
ngram length=2,qu ui ic ck
ngram length=3,qui uic ick
ngram length=4,quic uick
ngram length=5,quick
2、什麼是edge ngram
quick這個詞,拋錨首字母后進行ngram
q
qu
qui
quic
quick
使用edge ngram將每個單詞都進行進一步的分詞和切分,用切分後的ngram來實現字首搜尋推薦功能
hello world
hello we
h
he
hel
hell
hello doc1,doc2
w doc1,doc2
wo
wor
worl
world
e doc2
比如搜尋hello w
doc1和doc2都匹配hello和w,而且position也匹配,所以doc1和doc2被返回。
搜尋的時候,不用在根據一個字首,然後掃描整個倒排索引了;簡單的拿字首去倒排索引中匹配即可,如果匹配上了,那麼就完事了。
3、最大最小引數
min ngram = 1
max ngram = 3
最小几位最大幾位。(這裡是最小1位最大3位)
比如有helloworld單詞
那麼就是如下
h
he
hel
最大三位就停止了。
4、試驗一下ngram
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter" : {
"type" : "edge_ngram",
"min_gram" : 1,
"max_gram" : 20
}
},
"analyzer": {
"autocomplete" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
注意這裡search_analyzer為什麼是standard而不是autocomplete?
因為搜尋的時候沒必要在進行每個字母都拆分,比如搜尋hello w。直接拆分成hello和w去搜索就好了,沒必要弄成如下這樣:
h
he
hel
hell
hello
w
弄成這樣的話效率反而更低了。
插入4條資料
PUT /my_index/my_type/1
{
"title" : "hello world"
}
PUT /my_index/my_type/2
{
"title" : "hello we"
}
PUT /my_index/my_type/3
{
"title" : "hello win"
}
PUT /my_index/my_type/4
{
"title" : "hello dog"
}
執行搜尋
GET /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": "hello w"
}
}
}
結果
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.1983768,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1.1983768,
"_source": {
"title": "hello we"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.8271048,
"_source": {
"title": "hello world"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 0.797104,
"_source": {
"title": "hello win"
}
}
]
}
}
本來match_phrase不會分詞。只匹配短語,但是為什麼這樣卻能匹配出三條?
是因為我們建立mapping的時候對title進行了分詞設定,運用了ngram將他進行了拆分,而搜尋的時候按照標準的standard分詞器去拆分term,這樣效率槓槓的!!