ElasticSearch打分機制學習
建立一個索引
curl -s -XPUT 'http://localhost:9200/gino_test/' -d '{ "mappings": { "tweet": { "properties": { "text": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "string", "term_vector": "with_positions_offsets_payloads", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }'
插入測試資料:
_index | _type | _id | text | fullname |
---|---|---|---|---|
gino_test | tweet | 1 | hello world | gino zhang |
gino_test | tweet | 2 | gino like world cup | gino li |
gino_test | tweet | 3 | my cup | jsper li |
簡單情況:單欄位匹配打分
POST http://192.168.102.216:9200/gino_test/_search { "explain": true, "query": { "match": { "text": "my cup" } } }
查詢結果: score_example1.json https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2VjFIbEI3ZFQwRlk%22]}
score(q,d) = queryNorm(q) · coord(q,d) · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d))
- score(q,d) is the relevance score of document d for query q.
- queryNorm(q) is the query normalization factor (new).
- coord(q,d) is the coordination factor (new).
- The sum of the weights for each term t in the query q for document d.
- tf(t,d) is the term frequency for term t in document d.
- idf(t) is the inverse document frequency for term t.
- t.getBoost() is the boost that has been applied to the query (new).
- norm(t,d) is the field-length norm, combined with the index-time field-level boost, if any. (new).
You should recognize score, tf, and idf. The queryNorm, coord, t.getBoost, and norm are new.
注意:在計算過程中,涉及的變數應該考慮的是document所在的分片而不是整個index。
score(q,d) = _score(q,d.f) --------- ①
= queryNorm(q) · coord(q,d) · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d))
= coord(q,d) · ∑ (tf(t,d) · idf(t)² · t.getBoost() · norm(t,d) · queryNorm(q))
= coord(q,d.f) · ∑ _score(q.ti, d.f) [ti in q] --------- ②
= coord(q,d.f) · (_score(q.t1, d.f) + _score(q.t2, d.f))
- ① 相關性打分其實是查詢與某個文件的某個欄位之間的相關性打分,而不是與文件的相關性;
- ② 根據公式轉換,就變成了查詢的所有Term與文件中欄位的相關性求和,如果某個Term不相關,則需要處理coord係數;
multi-match多欄位匹配打分(best_fields模式)
POST http://192.168.102.216:9200/gino_test/_search
{
"explain": true,
"query": {
"multi_match": {
"query": "gino cup",
"fields": [
"text^8",
"fullname^5"
]
}
}
}
查詢結果:score_example2.json
https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2MTdsWHFqRGRsZUU%22]}
打分分析:
score(q,d) = max(_score(q, d.fi)) = max(_score(q, d.f1), _score(q, d.f2))
= max(coord(q,d.f1) · (_score(q.t1, d.f1) + _score(q.t2, d.f1)), coord(q,d.f2) · (_score(q.t1, d.f2) + _score(q.t2, d.f2)))
- 對於multi-field的best_fields模式來說,相當於是對每個欄位對查詢分別進行打分,然後執行max運算獲取打分最高的。
- 在計算query weight的過程需要乘上欄位的權重,在計算fieldNorm的時候也需要乘上欄位的權重。
- 預設operator為or,如果使用and,打分機制也是一樣的,但是搜尋結果會不一樣。
multi-match多欄位匹配打分(cross_fields模式)
POST http://192.168.102.216:9200/gino_test/_search
{
"explain": true,
"query": {
"multi_match": {
"query": "gino cup",
"type": "cross_fields",
"fields": [
"text^8",
"fullname^5"
]
}
}
}
查詢結果:score_example3.json
https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2OU40bWp1ZnlsT00%22]}
打分分析:
score(q, d) = ∑ (_score(q.ti, d.f)) = ∑ (_score(q.t1, d.f), _score(q.t1, d.f))
= ∑ (max(coord(q.t1,d.f) · _score(q.t1, d.f1), coord(q.t1,d.f) · _score(q.t1, d.f2)), max(coord(q.t2,d.f) · _score(q.t2, d.f1), coord(q.t2,d.f) · _score(q.t2, d.f2)))
- coord(q.t1,d.f)函式表示搜尋的Term(如gino)在multi-field中有多少比率的欄位匹配到;best_fields模式中coord(q,d.f1)表示搜尋的所以Term(如gino和cup)有多少比率存在與特定的field欄位(如text欄位)裡;
- 對於multi-field的cross_fields模式來說,相當於是對每個查詢的Term進行打分(每個Term執行best_fields打分,即看下哪個field匹配更高),然後執行sum運算。
- 預設operator為or,如果使用and,打分機制也是一樣的,但是搜尋結果會不一樣。score_example4.json
https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2SDFGSEFJNWVBZU0%22]}
should增加權重打分
為了增加filter的測試,給gino_test/tweet增加一個tags的欄位。
PUT /gino_test/_mapping/tweet
{
"properties": {
"tags": {
"type": "string",
"analyzer": "fulltext_analyzer"
}
}
}
增加tags的標籤
_index | _type | _id | text | fullname | tags |
---|---|---|---|---|---|
gino_test | tweet | 1 | hello world | gino zhang | new, gino |
gino_test | tweet | 2 | gino like world cup | gino li | hobby, gino |
gino_test | tweet | 3 | my cup | jsper li | goods, jasper |
POST http://192.168.102.216:9200/gino_test/_search
{
“explain”: true,
“query”: {
“bool”: {
“must”: {
“bool”: {
“must”: {
“multi_match”: {
“query”: “gino cup”,
“fields”: [
“text^8”,
“fullname^5”
],
“type”: “best_fields”,
“operator”: “or”
}
},
“should”: [
{
“term”: {
“tags”: {
“value”: “goods”,
“boost”: 6
}
}
},
{
“term”: {
“tags”: {
“value”: “hobby”,
“boost”: 3
}
}
}
]
}
}
}
}
}
查詢結果:score_example5.json https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2TFZQREgzdHh2NmM%22]}
- 增加了should的權重之後,相當於多了一個打分參考項,打分的過程見上面的計算過程。
function_score高階打分機制
DSL格式:
{
"function_score": {
"query": {},
"boost": "boost for the whole query",
"functions": [
{
"filter": {},
"FUNCTION": {},
"weight": number
},
{
"FUNCTION": {}
},
{
"filter": {},
"weight": number
}
],
"max_boost": number,
"score_mode": "(multiply|max|...)",
"boost_mode": "(multiply|replace|...)",
"min_score" : number
}
}
支援四種類型發FUNCTION:
- script_score: 自定義的高階打分機制,涉及的欄位只能是數值型別的
- weight: 權重打分,一般結合filter一起使用,表示滿足某種條件加多少倍的分
- random_score: 生成一個隨機分數,比如應該uid隨機打亂排序
- field_value_factor: 根據index裡的某個欄位值影響打分,比如銷量(涉及的欄位只能是數值型別的)
- decay functions: 衰減函式打分,比如越靠近市中心的打分越高
來做一個實驗。先給index增加一個檢視數的欄位:
PUT /gino_test/_mapping/tweet
{
"properties": {
"views": {
"type": "long",
"doc_values": true,
"fielddata": {
"format": "doc_values"
}
}
}
給三條資料分別加上檢視數的值:
POST gino_test/tweet/1/_update
{
"doc" : {
"views" : 56
}
}
最終資料的樣子:
_index | _type | _id | text | fullname | tags | views |
---|---|---|---|---|---|---|
gino_test | tweet | 1 | hello world | gino zhang | new, gino | 56 |
gino_test | tweet | 2 | gino like world cup | gino li | hobby, gino | 21 |
gino_test | tweet | 3 | my cup | jsper li | goods, jasper | 68 |
執行一個查詢:
{
"explain": true,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "gino cup",
"type": "cross_fields",
"fields": [
"text^8",
"fullname^5"
]
}
},
"boost": 2,
"functions": [
{
"field_value_factor": {
"field": "views",
"factor": 1.2,
"modifier": "sqrt",
"missing": 1
}
},
{
"filter": {
"term": {
"tags": {
"value": "goods"
}
}
},
"weight": 4
}
],
"score_mode": "multiply",
"boost_mode": "multiply"
}
}
}
查詢結果:score_example6.json https://drivenotepad.github.io/app/?state={%22action%22:%22open%22,%22ids%22:[%220B4dv03yigoV2MlRXU0xUUkdDZEU%22]}
打分分析:
score(q,d) = score_query(q,d) * (score_fvf(`view`) * score_filter(`tags:goods`))
- score_mode表示多個FUNCTION之間打分的運演算法則,需要注意不同的FUNCTION的打分的結果級別可能相差很大;
- boost_mode表示function_score和query_score打分的運演算法則,也需要注意打分結果的級別;
rescore重打分機制
重打分機制並不會應用到所有的資料中。比如需要查詢前10條資料,那麼所有的分片先按預設規則查詢出前10條資料,然後應用rescore規則進行重打分返回給master節點進行綜合排序返回給使用者。
rescore支援多個規則計算,以及與原先的預設打分進行運算(權重求和等)。
rescore因為計算的打分的document較少,效能應該會更好一點,但是這個涉及到全域性排序,實際運用的場景要注意。