ElasticSearch的評分機制詳解
1. 評分機制詳解
1.1. 評分機制 TF\IDF
1.1.1 演算法介紹
relevance score演算法,簡單來說,就是計算出,一個索引中的文字,與搜尋文字,他們之間的關聯匹配程度。
Elasticsearch使用的是 term frequency/inverse document frequency演算法,簡稱為TF/IDF演算法。TF詞頻(Term Frequency),IDF逆向檔案頻率(Inverse Document Frequency)
Term frequency:搜尋文字中的各個詞條在field文字中出現了多少次,出現次數越多,就越相關。
舉例:搜尋請求:hello world
doc1 : hello you and me,and world is very good.
doc2 : hello,how are you
Inverse document frequency:搜尋文字中的各個詞條在整個索引的所有文件中出現了多少次,出現的次數越多,就越不相關.
舉例:搜尋請求:hello world
doc1 : hello ,today is very good
doc2 : hi world ,how are you
整個index中1億條資料。hello的document 1000個,有world的document 有100個。
doc2 更相關
Field-length norm
舉例:搜尋請求:hello world
doc1 : {"title":"hello article","content ":"balabalabal 1萬個"}
doc2 : {"title":"my article","content ":"balabalabal 1萬個,world"}
1.1.2 _score是如何被計算出來的
GET /book/_search?explain=true
{
"query": {
"match": {
"description": "java程式設計師"
}
}
}
返回
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.137549, "hits" : [ { "_shard" : "[book][0]", "_node" : "MDA45-r6SUGJ0ZyqyhTINA", "_index" : "book", "_type" : "_doc", "_id" : "3", "_score" : 2.137549, "_source" : { "name" : "spring開發基礎", "description" : "spring 在java領域非常流行,java程式設計師都在用。", "studymodel" : "201001", "price" : 88.6, "timestamp" : "2019-08-24 19:11:35", "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg", "tags" : [ "spring", "java" ] }, "_explanation" : { "value" : 2.137549, "description" : "sum of:", "details" : [ { "value" : 0.7936629, "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.7936629, "description" : "score(freq=2.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.47000363, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 2, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 3, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.7675597, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 2.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 12.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 35.333332, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] }, { "value" : 1.3438859, "description" : "weight(description:程式設計師 in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 1.3438859, "description" : "score(freq=1.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.98082924, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 1, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 3, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.6227967, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 12.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 35.333332, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] } }, { "_shard" : "[book][0]", "_node" : "MDA45-r6SUGJ0ZyqyhTINA", "_index" : "book", "_type" : "_doc", "_id" : "2", "_score" : 0.57961315, "_source" : { "name" : "java程式設計思想", "description" : "java語言是世界第一程式語言,在軟體開發領域使用人數最多。", "studymodel" : "201001", "price" : 68.6, "timestamp" : "2019-08-25 19:11:35", "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg", "tags" : [ "java", "dev" ] }, "_explanation" : { "value" : 0.57961315, "description" : "sum of:", "details" : [ { "value" : 0.57961315, "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.57961315, "description" : "score(freq=1.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.47000363, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 2, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 3, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.56055, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 19.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 35.333332, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] } } ] } }
1.1.3 分析一個document是如何被匹配上的
GET /book/_explain/3
{
"query": {
"match": {
"description": "java程式設計師"
}
}
}
1.2. Doc value
搜尋的時候,要依靠倒排索引;排序的時候,需要依靠正排索引,看到每個document的每個field,然後進行排序,所謂的正排索引,其實就是doc values
在建立索引的時候,一方面會建立倒排索引,以供搜尋用;一方面會建立正排索引,也就是doc values,以供排序,聚合,過濾等操作使用
doc values是被儲存在磁碟上的,此時如果記憶體足夠,os會自動將其快取在記憶體中,效能還是會很高;如果記憶體不足夠,os會將其寫入磁碟上
倒排索引
doc1: hello world you and me
doc2: hi, world, how are you
term | doc1 | doc2 |
---|---|---|
hello | * | |
world | * | * |
you | * | * |
and | * | |
me | * | |
hi | * | |
how | * | |
are | * |
搜尋時:
hello you --> hello, you
hello --> doc1
you --> doc1,doc2
doc1: hello world you and me
doc2: hi, world, how are you
sort by 出現問題
正排索引
doc1: { "name": "jack", "age": 27 }
doc2: { "name": "tom", "age": 30 }
document | name | age |
---|---|---|
doc1 | jack | 27 |
doc2 | tom | 30 |
1.3. query phase
1.3.1、query phase
(1)搜尋請求傳送到某一個coordinate node,構構建一個priority queue,長度以paging操作from和size為準,預設為10
(2)coordinate node將請求轉發到所有shard,每個shard本地搜尋,並構建一個本地的priority queue
(3)各個shard將自己的priority queue返回給coordinate node,並構建一個全域性的priority queue
1.3.2、replica shard如何提升搜尋吞吐量
一次請求要打到所有shard的一個replica/primary上去,如果每個shard都有多個replica,那麼同時併發過來的搜尋請求可以同時打到其他的replica上去
1.4. fetch phase
1.4.1、fetch phbase工作流程
(1)coordinate node構建完priority queue之後,就傳送mget請求去所有shard上獲取對應的document
(2)各個shard將document返回給coordinate node
(3)coordinate node將合併後的document結果返回給client客戶端
1.4.2、一般搜尋,如果不加from和size,就預設搜尋前10條,按照_score排序
1.5. 搜尋引數小總結
1、preference
決定了哪些shard會被用來執行搜尋操作
_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3
bouncing results問題,兩個document排序,field值相同;不同的shard上,可能排序不同;每次請求輪詢打到不同的replica shard上;每次頁面上看到的搜尋結果的排序都不一樣。這就是bouncing result,也就是跳躍的結果。
搜尋的時候,是輪詢將搜尋請求傳送到每一個replica shard(primary shard),但是在不同的shard上,可能document的排序不同
解決方案就是將preference設定為一個字串,比如說user_id,讓每個user每次搜尋的時候,都使用同一個replica shard去執行,就不會看到bouncing results了
2、timeout
主要就是限定在一定時間內,將部分獲取到的資料直接返回,避免查詢耗時過長
3、routing
document文件路由,_id路由,routing=user_id,這樣的話可以讓同一個user對應的資料到一個shard上去
4、search_type
default:query_then_fetch
dfs_query_then_fetch,可以提升revelance sort精準度