1. 程式人生 > >ElasticSearch教程——proximity match 近似匹配

ElasticSearch教程——proximity match 近似匹配

ElasticSearch彙總請檢視:ElasticSearch教程——彙總篇

1、什麼是近似匹配

兩個句子

java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark's programming language and scala is also based on jvm like java.

match query,搜尋java spark

{
	"match": {
		"content": "java spark"
	}
}

match query,只能搜尋到包含java和spark的document,但是不知道java和spark是不是離的很近

包含java或包含spark,或包含java和spark的doc,都會被返回回來。我們其實並不知道哪個doc,java和spark距離的比較近。如果我們就是希望搜尋java spark,中間不能插入任何其他的字元,那這個時候match去做全文檢索,能搞定我們的需求嗎?答案是,搞不定。

如果我們要儘量讓java和spark離的很近的document優先返回,要給它一個更高的relevance score,這就涉及到了proximity match,近似匹配

如果說,要實現兩個需求:

1、java spark,就靠在一起,中間不能插入任何其他字元,就要搜尋出來這種doc
2、java spark,但是要求,java和spark兩個單詞靠的越近,doc的分數越高,排名越靠前

要實現上述兩個需求,用match做全文檢索,是搞不定的,必須得用proximity match,近似匹配

phrase match,proximity match:短語匹配,近似匹配

這一講,要學習的是phrase match,就是僅僅搜尋出java和spark靠在一起的那些doc,比如有個doc,是java use'd spark,不行。必須是比如java spark are very good friends,是可以搜尋出來的。

phrase match,就是要去將多個term作為一個短語,一起去搜索,只有包含這個短語的doc才會作為結果返回。不像是match,java spark,java的doc也會返回,spark的doc也會返回。

2、match_phrase

GET /forum/article/_search
{
  "query": {
    "match": {
      "content": "java spark"
    }
  }
}

單單包含java的doc也返回了,不是我們想要的結果

POST /forum/article/5/_update
{
  "doc": {
    "content": "spark is best big data solution based on scala ,an programming language similar to java spark"
  }
}

將一個doc的content設定為恰巧包含java spark這個短語

 

match_phrase語法

GET /forum/article/_search
{
    "query": {
        "match_phrase": {
            "content": "java spark"
        }
    }
}

成功了,只有包含java spark這個短語的doc才返回了,只包含java的doc不會返回

3、term position

hello world, java spark        doc1
hi, spark java                doc2

hello         doc1(0)        
wolrd        doc1(1)
java                doc1(2) doc2(2)
spark        doc1(3) doc2(1)

瞭解什麼是分詞後的position
 

GET _analyze
{
  "text": "hello world, java spark",
  "analyzer": "standard"
}
 
{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 13,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "spark",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

4、match_phrase的基本原理

索引中的position,match_phrase

hello world, java spark        doc1
hi, spark java                doc2

hello         doc1(0)        
wolrd        doc1(1)
java                doc1(2) doc2(2)
spark        doc1(3) doc2(1)

java spark --> match phrase

java spark --> java和spark

java --> doc1(2) doc2(2)
spark --> doc1(3) doc2(1)

要找到每個term都在的一個共有的那些doc,就是要求一個doc,必須包含每個term,才能拿出來繼續計算

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,恰好滿足條件

doc1符合條件

doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不滿足,那麼doc2不匹配

必須理解這塊原理!!!!

因為後面的proximity match就是原理跟這個一模一樣!!!

 

slop

注:proximity match= match phrase+ slop

GET /forum/article/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "java spark",
                "slop":  1
            }
        }
    }
}

 

slop含義

query string,搜尋文字,中的幾個term,要經過幾次移動才能與一個document匹配,這個移動的次數,就是slop

實際舉例,一個query string經過幾次移動之後可以匹配到一個document,然後設定slop

hello world, java is very good, spark is also very good.

java spark,match phrase,搜不到

如果我們指定了slop,那麼就允許java spark進行移動,來嘗試與doc進行匹配

java        is        very        good        spark        is

java        spark
java        -->        spark
java                -->        spark
java                        -->            spark

這裡的slop,就是3,因為java spark這個短語,spark移動了3次,就可以跟一個doc匹配上了

slop的含義,不僅僅是說一個query string terms移動幾次,跟一個doc匹配上。一個query string terms,最多可以移動幾次去嘗試跟一個doc匹配上

slop,設定的是3,那麼就ok
 

GET /forum/article/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "java spark",
                "slop":  3
            }
        }
    }
}

就可以把剛才那個doc匹配上,那個doc會作為結果返回

 

但是如果slop設定的是2,那麼java spark,spark最多隻能移動2次,此時跟doc是匹配不上的,那個doc是不會作為結果返回的

驗證slop的含義

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "spark data",
        "slop": 3
      }
    }
  }
}

spark  is    best     big    data     solution based on scala ,an programming language similar to java spark

spark data
          --> data
                      --> data
spark                  --> data
 

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "data spark",
        "slop": 5
      }
    }
  }
}

spark        is                best        big            data

data        spark
-->            data/spark
spark        <--data
spark        -->                data
spark                        -->        data
spark                                -->            data

slop搜尋下,關鍵詞離的越近,relevance score就會越高

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java best",
        "slop": 15
      }
    }
  }
}
 
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.65380025,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.65380025,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.07111243,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

其實,加了slop的phrase match,就是proximity match,近似匹配

1、java spark,短語,doc,phrase match
2、java spark,可以有一定的距離,但是靠的越近,越先搜尋出來,proximity match

召回率

比如你搜索一個java spark,總共有100個doc,能返回多少個doc作為結果,就是召回率,recall

 

精準度

比如你搜索一個java spark,能不能儘可能讓包含java spark,或者是java和spark離的很近的doc,排在最前面,precision

直接用match_phrase短語搜尋,會導致必須所有term都在doc field中出現,而且距離在slop限定範圍內,才能匹配上

match phrase,proximity match,要求doc必須包含所有的term,才能作為結果返回;如果某一個doc可能就是有某個term沒有包含,那麼就無法作為結果返回

java spark --> hello world java --> 就不能返回了
java spark --> hello world, java spark --> 才可以返回

近似匹配的時候,召回率比較低,精準度太高了

但是有時可能我們希望的是匹配到幾個term中的部分,就可以作為結果出來,這樣可以提高召回率。同時我們也希望用上match_phrase根據距離提升分數的功能,讓幾個term距離越近分數就越高,優先返回

就是優先滿足召回率,意思,java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同時兼顧精準度,就是包含java和spark,同時java和spark離的越近的doc排在最前面

此時可以用bool組合match query和match_phrase query一起,來實現上述效果
 

GET /forum/article/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":  "java spark" --> java或spark或java spark,java和spark靠前,但是沒法區分java和spark的距離,也許java和spark靠的很近,但是沒法排在最前面
          }
        }
      },
      "should": {
        "match_phrase": { --> 在slop以內,如果java spark能匹配上一個doc,那麼就會對doc貢獻自己的relevance score,如果java和spark靠的越近,那麼就分數越高
          "title": {
            "query": "java spark",
            "slop":  50
          }
        }
      }
    }
  }
}

對比 match phrase,proximity match查詢結果

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java spark"
          }
        }
      ]
    }
  }
}
 
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.68640786,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith",
          "followers": [
            "Tom",
            "Jack"
          ]
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.68324494,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny",
          "followers": [
            "Jack",
            "Robbin Li"
          ]
        }
      }
    ]
  }
}
GET /forum/article/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java spark"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "java spark",
              "slop": 50
            }
          }
        }
      ]
    }
  }
}
 
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.258609,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1.258609,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny",
          "followers": [
            "Jack",
            "Robbin Li"
          ]
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith",
          "followers": [
            "Tom",
            "Jack"
          ]
        }
      }
    ]
  }
}

match和phrase match(proximity match)區別

match --> 只要簡單的匹配到了一個term,就可以理解將term對應的doc作為結果返回,掃描倒排索引,掃描到了就ok

phrase match --> 首先掃描到所有term的doc list; 找到包含所有term的doc list; 然後對每個doc都計算每個term的position,是否符合指定的範圍; slop,需要進行復雜的運算,來判斷能否通過slop移動,匹配一個doc

match query的效能比phrase match和proximity match(有slop)要高很多。因為後兩者都要計算position的距離。
match query比phrase match的效能要高10倍,比proximity match的效能要高20倍。

但是別太擔心,因為es的效能一般都在毫秒級別,match query一般就在幾毫秒,或者幾十毫秒,而phrase match和proximity match的效能在幾十毫秒到幾百毫秒之間,所以也是可以接受的。

優化proximity match的效能,一般就是減少要進行proximity match搜尋的document數量。主要思路就是,用match query先過濾出需要的資料,然後再用proximity match來根據term距離提高doc的分數,同時proximity match只針對每個shard的分數排名前n個doc起作用,來重新調整它們的分數,這個過程稱之為rescoring,重計分。因為一般使用者會分頁查詢,只會看到前幾頁的資料,所以不需要對所有結果進行proximity match操作。

用我們剛才的說法,match + proximity match同時實現召回率和精準度

預設情況下,match也許匹配了1000個doc,proximity match全都需要對每個doc進行一遍運算,判斷能否slop移動匹配上,然後去貢獻自己的分數
但是很多情況下,match出來也許1000個doc,其實使用者大部分情況下是分頁查詢的,所以可能最多隻會看前幾頁,比如一頁是10條,最多也許就看5頁,就是50條
proximity match只要對前50個doc進行slop移動去匹配,去貢獻自己的分數即可,不需要對全部1000個doc都去進行計算和貢獻分數

rescore(重打分)

match:1000個doc,其實這時候每個doc都有一個分數了; proximity match,前50個doc,進行rescore,重打分,即可; 讓前50個doc,term舉例越近的,排在越前面

GET /forum/article/_search 
{
  "query": {
    "match": {
      "content": "java spark"
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "java spark",
            "slop": 50
          }
        }
      }
    }
  }
}