1. 程式人生 > 實用技巧 >ElasticSearch的評分機制詳解


1. 評分機制詳解

1.1. 評分機制 TF\IDF

1.1.1 演算法介紹

relevance score演算法,簡單來說,就是計算出,一個索引中的文字,與搜尋文字,他們之間的關聯匹配程度。

Elasticsearch使用的是 term frequency/inverse document frequency演算法,簡稱為TF/IDF演算法。TF詞頻(Term Frequency),IDF逆向檔案頻率(Inverse Document Frequency)

Term frequency:搜尋文字中的各個詞條在field文字中出現了多少次,出現次數越多,就越相關。

舉例:搜尋請求:hello world

doc1 : hello you and me,and world is very good.

doc2 : hello,how are you

Inverse document frequency:搜尋文字中的各個詞條在整個索引的所有文件中出現了多少次,出現的次數越多,就越不相關.

舉例:搜尋請求:hello world

doc1 : hello ,today is very good

doc2 : hi world ,how are you

整個index中1億條資料。hello的document 1000個,有world的document 有100個。

doc2 更相關

Field-length norm


舉例:搜尋請求:hello world

doc1 : {"title":"hello article","content ":"balabalabal 1萬個"}

doc2 : {"title":"my article","content ":"balabalabal 1萬個,world"}

1.1.2 _score是如何被計算出來的

GET /book/_search?explain=true
  "query": {
    "match": {
      "description": "java程式設計師"


  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    "max_score" : 2.137549,
    "hits" : [
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.137549,
        "_source" : {
          "name" : "spring開發基礎",
          "description" : "spring 在java領域非常流行,java程式設計師都在用。",
          "studymodel" : "201001",
          "price" : 88.6,
          "timestamp" : "2019-08-24 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
        "_explanation" : {
          "value" : 2.137549,
          "description" : "sum of:",
          "details" : [
              "value" : 0.7936629,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                  "value" : 0.7936629,
                  "description" : "score(freq=2.0), product of:",
                  "details" : [
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                      "value" : 0.7675597,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                          "value" : 2.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
              "value" : 1.3438859,
              "description" : "weight(description:程式設計師 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                  "value" : 1.3438859,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                      "value" : 0.6227967,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.57961315,
        "_source" : {
          "name" : "java程式設計思想",
          "description" : "java語言是世界第一程式語言,在軟體開發領域使用人數最多。",
          "studymodel" : "201001",
          "price" : 68.6,
          "timestamp" : "2019-08-25 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
        "_explanation" : {
          "value" : 0.57961315,
          "description" : "sum of:",
          "details" : [
              "value" : 0.57961315,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                  "value" : 0.57961315,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                      "value" : 0.56055,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                          "value" : 19.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]

1.1.3 分析一個document是如何被匹配上的

GET /book/_explain/3
  "query": {
    "match": {
      "description": "java程式設計師"

1.2. Doc value

搜尋的時候,要依靠倒排索引;排序的時候,需要依靠正排索引,看到每個document的每個field,然後進行排序,所謂的正排索引,其實就是doc values

在建立索引的時候,一方面會建立倒排索引,以供搜尋用;一方面會建立正排索引,也就是doc values,以供排序,聚合,過濾等操作使用

doc values是被儲存在磁碟上的,此時如果記憶體足夠,os會自動將其快取在記憶體中,效能還是會很高;如果記憶體不足夠,os會將其寫入磁碟上


doc1: hello world you and me

doc2: hi, world, how are you

term doc1 doc2
hello *
world * *
you * *
and *
me *
hi *
how *
are *


hello you --> hello, you

hello --> doc1

you --> doc1,doc2

doc1: hello world you and me

doc2: hi, world, how are you

sort by 出現問題


doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }

document name age
doc1 jack 27
doc2 tom 30

1.3. query phase

1.3.1、query phase

(1)搜尋請求傳送到某一個coordinate node,構構建一個priority queue,長度以paging操作from和size為準,預設為10

(2)coordinate node將請求轉發到所有shard,每個shard本地搜尋,並構建一個本地的priority queue

(3)各個shard將自己的priority queue返回給coordinate node,並構建一個全域性的priority queue

1.3.2、replica shard如何提升搜尋吞吐量


1.4. fetch phase

1.4.1、fetch phbase工作流程

(1)coordinate node構建完priority queue之後,就傳送mget請求去所有shard上獲取對應的document

(2)各個shard將document返回給coordinate node

(3)coordinate node將合併後的document結果返回給client客戶端


1.5. 搜尋引數小總結



_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

bouncing results問題,兩個document排序,field值相同;不同的shard上,可能排序不同;每次請求輪詢打到不同的replica shard上;每次頁面上看到的搜尋結果的排序都不一樣。這就是bouncing result,也就是跳躍的結果。

搜尋的時候,是輪詢將搜尋請求傳送到每一個replica shard(primary shard),但是在不同的shard上,可能document的排序不同

解決方案就是將preference設定為一個字串,比如說user_id,讓每個user每次搜尋的時候,都使用同一個replica shard去執行,就不會看到bouncing results了







dfs_query_then_fetch,可以提升revelance sort精準度