ElasticSearch教程——proximity match 近似匹配
ElasticSearch彙總請檢視:ElasticSearch教程——彙總篇
1、什麼是近似匹配
兩個句子
java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark's programming language and scala is also based on jvm like java.
match query,搜尋java spark
{
"match": {
"content": "java spark"
}
}
match query,只能搜尋到包含java和spark的document,但是不知道java和spark是不是離的很近
包含java或包含spark,或包含java和spark的doc,都會被返回回來。我們其實並不知道哪個doc,java和spark距離的比較近。如果我們就是希望搜尋java spark,中間不能插入任何其他的字元,那這個時候match去做全文檢索,能搞定我們的需求嗎?答案是,搞不定。
如果我們要儘量讓java和spark離的很近的document優先返回,要給它一個更高的relevance score,這就涉及到了proximity match,近似匹配
如果說,要實現兩個需求:
1、java spark,就靠在一起,中間不能插入任何其他字元,就要搜尋出來這種doc
2、java spark,但是要求,java和spark兩個單詞靠的越近,doc的分數越高,排名越靠前
要實現上述兩個需求,用match做全文檢索,是搞不定的,必須得用proximity match,近似匹配
phrase match,proximity match:短語匹配,近似匹配
這一講,要學習的是phrase match,就是僅僅搜尋出java和spark靠在一起的那些doc,比如有個doc,是java use'd spark,不行。必須是比如java spark are very good friends,是可以搜尋出來的。
phrase match,就是要去將多個term作為一個短語,一起去搜索,只有包含這個短語的doc才會作為結果返回。不像是match,java spark,java的doc也會返回,spark的doc也會返回。
2、match_phrase
GET /forum/article/_search
{
"query": {
"match": {
"content": "java spark"
}
}
}
單單包含java的doc也返回了,不是我們想要的結果
POST /forum/article/5/_update
{
"doc": {
"content": "spark is best big data solution based on scala ,an programming language similar to java spark"
}
}
將一個doc的content設定為恰巧包含java spark這個短語
match_phrase語法
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": "java spark"
}
}
}
成功了,只有包含java spark這個短語的doc才返回了,只包含java的doc不會返回
3、term position
hello world, java spark doc1
hi, spark java doc2
hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
瞭解什麼是分詞後的position
GET _analyze
{
"text": "hello world, java spark",
"analyzer": "standard"
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "java",
"start_offset": 13,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "spark",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 3
}
]
}
4、match_phrase的基本原理
索引中的position,match_phrase
hello world, java spark doc1
hi, spark java doc2
hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
java spark --> match phrase
java spark --> java和spark
java --> doc1(2) doc2(2)
spark --> doc1(3) doc2(1)
要找到每個term都在的一個共有的那些doc,就是要求一個doc,必須包含每個term,才能拿出來繼續計算
doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,恰好滿足條件
doc1符合條件
doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不滿足,那麼doc2不匹配
必須理解這塊原理!!!!
因為後面的proximity match就是原理跟這個一模一樣!!!
slop
注:proximity match= match phrase+ slop
GET /forum/article/_search
{
"query": {
"match_phrase": {
"title": {
"query": "java spark",
"slop": 1
}
}
}
}
slop含義
query string,搜尋文字,中的幾個term,要經過幾次移動才能與一個document匹配,這個移動的次數,就是slop
實際舉例,一個query string經過幾次移動之後可以匹配到一個document,然後設定slop
hello world, java is very good, spark is also very good.
java spark,match phrase,搜不到
如果我們指定了slop,那麼就允許java spark進行移動,來嘗試與doc進行匹配
java is very good spark is
java spark
java --> spark
java --> spark
java --> spark
這裡的slop,就是3,因為java spark這個短語,spark移動了3次,就可以跟一個doc匹配上了
slop的含義,不僅僅是說一個query string terms移動幾次,跟一個doc匹配上。一個query string terms,最多可以移動幾次去嘗試跟一個doc匹配上
slop,設定的是3,那麼就ok
GET /forum/article/_search
{
"query": {
"match_phrase": {
"title": {
"query": "java spark",
"slop": 3
}
}
}
}
就可以把剛才那個doc匹配上,那個doc會作為結果返回
但是如果slop設定的是2,那麼java spark,spark最多隻能移動2次,此時跟doc是匹配不上的,那個doc是不會作為結果返回的
驗證slop的含義
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "spark data",
"slop": 3
}
}
}
}
spark is best big data solution based on scala ,an programming language similar to java spark
spark data
--> data
--> data
spark --> data
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "data spark",
"slop": 5
}
}
}
}
spark is best big data
data spark
--> data/spark
spark <--data
spark --> data
spark --> data
spark --> data
slop搜尋下,關鍵詞離的越近,relevance score就會越高
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java best",
"slop": 15
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.65380025,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.65380025,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.07111243,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}
其實,加了slop的phrase match,就是proximity match,近似匹配
1、java spark,短語,doc,phrase match
2、java spark,可以有一定的距離,但是靠的越近,越先搜尋出來,proximity match
召回率
比如你搜索一個java spark,總共有100個doc,能返回多少個doc作為結果,就是召回率,recall
精準度
比如你搜索一個java spark,能不能儘可能讓包含java spark,或者是java和spark離的很近的doc,排在最前面,precision
直接用match_phrase短語搜尋,會導致必須所有term都在doc field中出現,而且距離在slop限定範圍內,才能匹配上
match phrase,proximity match,要求doc必須包含所有的term,才能作為結果返回;如果某一個doc可能就是有某個term沒有包含,那麼就無法作為結果返回
java spark --> hello world java --> 就不能返回了
java spark --> hello world, java spark --> 才可以返回
近似匹配的時候,召回率比較低,精準度太高了
但是有時可能我們希望的是匹配到幾個term中的部分,就可以作為結果出來,這樣可以提高召回率。同時我們也希望用上match_phrase根據距離提升分數的功能,讓幾個term距離越近分數就越高,優先返回
就是優先滿足召回率,意思,java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同時兼顧精準度,就是包含java和spark,同時java和spark離的越近的doc排在最前面
此時可以用bool組合match query和match_phrase query一起,來實現上述效果
GET /forum/article/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": {
"query": "java spark" --> java或spark或java spark,java和spark靠前,但是沒法區分java和spark的距離,也許java和spark靠的很近,但是沒法排在最前面
}
}
},
"should": {
"match_phrase": { --> 在slop以內,如果java spark能匹配上一個doc,那麼就會對doc貢獻自己的relevance score,如果java和spark靠的越近,那麼就分數越高
"title": {
"query": "java spark",
"slop": 50
}
}
}
}
}
}
對比 match phrase,proximity match查詢結果
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
]
}
}
}
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.68640786,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith",
"followers": [
"Tom",
"Jack"
]
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.68324494,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny",
"followers": [
"Jack",
"Robbin Li"
]
}
}
]
}
}
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
],
"should": [
{
"match_phrase": {
"content": {
"query": "java spark",
"slop": 50
}
}
}
]
}
}
}
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.258609,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 1.258609,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny",
"followers": [
"Jack",
"Robbin Li"
]
}
},
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith",
"followers": [
"Tom",
"Jack"
]
}
}
]
}
}
match和phrase match(proximity match)區別
match --> 只要簡單的匹配到了一個term,就可以理解將term對應的doc作為結果返回,掃描倒排索引,掃描到了就ok
phrase match --> 首先掃描到所有term的doc list; 找到包含所有term的doc list; 然後對每個doc都計算每個term的position,是否符合指定的範圍; slop,需要進行復雜的運算,來判斷能否通過slop移動,匹配一個doc
match query的效能比phrase match和proximity match(有slop)要高很多。因為後兩者都要計算position的距離。
match query比phrase match的效能要高10倍,比proximity match的效能要高20倍。
但是別太擔心,因為es的效能一般都在毫秒級別,match query一般就在幾毫秒,或者幾十毫秒,而phrase match和proximity match的效能在幾十毫秒到幾百毫秒之間,所以也是可以接受的。
優化proximity match的效能,一般就是減少要進行proximity match搜尋的document數量。主要思路就是,用match query先過濾出需要的資料,然後再用proximity match來根據term距離提高doc的分數,同時proximity match只針對每個shard的分數排名前n個doc起作用,來重新調整它們的分數,這個過程稱之為rescoring,重計分。因為一般使用者會分頁查詢,只會看到前幾頁的資料,所以不需要對所有結果進行proximity match操作。
用我們剛才的說法,match + proximity match同時實現召回率和精準度
預設情況下,match也許匹配了1000個doc,proximity match全都需要對每個doc進行一遍運算,判斷能否slop移動匹配上,然後去貢獻自己的分數
但是很多情況下,match出來也許1000個doc,其實使用者大部分情況下是分頁查詢的,所以可能最多隻會看前幾頁,比如一頁是10條,最多也許就看5頁,就是50條
proximity match只要對前50個doc進行slop移動去匹配,去貢獻自己的分數即可,不需要對全部1000個doc都去進行計算和貢獻分數
rescore(重打分)
match:1000個doc,其實這時候每個doc都有一個分數了; proximity match,前50個doc,進行rescore,重打分,即可; 讓前50個doc,term舉例越近的,排在越前面
GET /forum/article/_search
{
"query": {
"match": {
"content": "java spark"
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop": 50
}
}
}
}
}
}