ElasticSearch的評分機制詳解

阿新 • • 發佈：2020-11-02

1．評分機制詳解

1.1．評分機制 TF\IDF

1.1.1 演算法介紹

relevance score演算法，簡單來說，就是計算出，一個索引中的文字，與搜尋文字，他們之間的關聯匹配程度。

Elasticsearch使用的是 term frequency/inverse document frequency演算法，簡稱為TF/IDF演算法。TF詞頻(Term Frequency)，IDF逆向檔案頻率(Inverse Document Frequency)

Term frequency：搜尋文字中的各個詞條在field文字中出現了多少次，出現次數越多，就越相關。

舉例：搜尋請求：hello world

doc1 : hello you and me,and world is very good.

doc2 : hello,how are you

Inverse document frequency：搜尋文字中的各個詞條在整個索引的所有文件中出現了多少次，出現的次數越多，就越不相關.

舉例：搜尋請求：hello world

doc1 : hello ,today is very good

doc2 : hi world ,how are you

整個index中1億條資料。hello的document 1000個，有world的document 有100個。

doc2 更相關

Field-length norm

：field長度，field越長，相關度越弱

舉例：搜尋請求：hello world

doc1 : {"title":"hello article","content ":"balabalabal 1萬個"}

doc2 : {"title":"my article","content ":"balabalabal 1萬個,world"}

1.1.2 _score是如何被計算出來的

GET /book/_search?explain=true
{
  "query": {
    "match": {
      "description": "java程式設計師"
    }
  }
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.137549,
    "hits" : [
      {
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.137549,
        "_source" : {
          "name" : "spring開發基礎",
          "description" : "spring 在java領域非常流行，java程式設計師都在用。",
          "studymodel" : "201001",
          "price" : 88.6,
          "timestamp" : "2019-08-24 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
            "spring",
            "java"
          ]
        },
        "_explanation" : {
          "value" : 2.137549,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.7936629,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.7936629,
                  "description" : "score(freq=2.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.7675597,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 2.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3438859,
              "description" : "weight(description:程式設計師 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.3438859,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.6227967,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.57961315,
        "_source" : {
          "name" : "java程式設計思想",
          "description" : "java語言是世界第一程式語言，在軟體開發領域使用人數最多。",
          "studymodel" : "201001",
          "price" : 68.6,
          "timestamp" : "2019-08-25 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
            "java",
            "dev"
          ]
        },
        "_explanation" : {
          "value" : 0.57961315,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.57961315,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.57961315,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.56055,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 19.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

1.1.3 分析一個document是如何被匹配上的

GET /book/_explain/3
{
  "query": {
    "match": {
      "description": "java程式設計師"
    }
  }
}

1.2． Doc value

搜尋的時候，要依靠倒排索引；排序的時候，需要依靠正排索引，看到每個document的每個field，然後進行排序，所謂的正排索引，其實就是doc values

在建立索引的時候，一方面會建立倒排索引，以供搜尋用；一方面會建立正排索引，也就是doc values，以供排序，聚合，過濾等操作使用

doc values是被儲存在磁碟上的，此時如果記憶體足夠，os會自動將其快取在記憶體中，效能還是會很高；如果記憶體不足夠，os會將其寫入磁碟上

倒排索引

doc1: hello world you and me

doc2: hi, world, how are you

term	doc1	doc2
hello	*
world	*	*
you	*	*
and	*
me	*
hi		*
how		*
are		*

搜尋時：

hello you --> hello, you

hello --> doc1

you --> doc1,doc2

doc1: hello world you and me

doc2: hi, world, how are you

sort by 出現問題

正排索引

doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }

document	name	age
doc1	jack	27
doc2	tom	30

1.3． query phase

1.3.1、query phase

（1）搜尋請求傳送到某一個coordinate node，構構建一個priority queue，長度以paging操作from和size為準，預設為10

（2）coordinate node將請求轉發到所有shard，每個shard本地搜尋，並構建一個本地的priority queue

（3）各個shard將自己的priority queue返回給coordinate node，並構建一個全域性的priority queue

1.3.2、replica shard如何提升搜尋吞吐量

一次請求要打到所有shard的一個replica/primary上去，如果每個shard都有多個replica，那麼同時併發過來的搜尋請求可以同時打到其他的replica上去

1.4． fetch phase

1.4.1、fetch phbase工作流程

（1）coordinate node構建完priority queue之後，就傳送mget請求去所有shard上獲取對應的document

（2）各個shard將document返回給coordinate node

（3）coordinate node將合併後的document結果返回給client客戶端

1.4.2、一般搜尋，如果不加from和size，就預設搜尋前10條，按照_score排序

1.5．搜尋引數小總結

1、preference

決定了哪些shard會被用來執行搜尋操作

_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

bouncing results問題，兩個document排序，field值相同；不同的shard上，可能排序不同；每次請求輪詢打到不同的replica shard上；每次頁面上看到的搜尋結果的排序都不一樣。這就是bouncing result，也就是跳躍的結果。

搜尋的時候，是輪詢將搜尋請求傳送到每一個replica shard（primary shard），但是在不同的shard上，可能document的排序不同

解決方案就是將preference設定為一個字串，比如說user_id，讓每個user每次搜尋的時候，都使用同一個replica shard去執行，就不會看到bouncing results了

2、timeout

主要就是限定在一定時間內，將部分獲取到的資料直接返回，避免查詢耗時過長

3、routing

document文件路由，_id路由，routing=user_id，這樣的話可以讓同一個user對應的資料到一個shard上去

4、search_type

default：query_then_fetch

dfs_query_then_fetch，可以提升revelance sort精準度

ElasticSearch的評分機制詳解

1．評分機制詳解 1.1．評分機制 TF\\IDF 1.1.1 演算法介紹 relevance score演算法，簡單來說，就是計算出，一個索引中的文字，與搜尋文字，他們之間的關聯匹配程度。

ElasticSearch 文件（document）內部機制詳解

1、資料路由 1.1 文件儲存怎麼路由到相應分片？一個文件，最終會落在主分片的一個分片上，到底應該在哪一個分片？這就是資料路由。

Redis 事件機制詳解

Redis 採用事件驅動機制來處理大量的網路IO。它並沒有使用 libevent 或者 libev 這樣的成熟開源方案，而是自己實現一個非常簡潔的事件驅動庫 ae_event。

JavaMail郵件傳送機制詳解

這篇文章主要介紹了JavaMail郵件傳送機制詳解,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

python實現差分隱私Laplace機制詳解

Laplace分佈定義：下面先給出Laplace分佈實現程式碼： import matplotlib.pyplot as plt import numpy as np

python elasticsearch環境搭建詳解

windows下載zip linux下載tar 下載地址：https://www.elastic.co/downloads/elasticsearch 解壓後執行：bin/elasticsearch (or bin\\elasticsearch.bat on Windows)

Python的垃圾回收機制詳解

引用計數在Python原始碼中，每一個物件都是一個結構體表示，都有一個計數字段。

Spring data elasticsearch使用方法詳解

這篇文章主要介紹了Spring data elasticsearch使用方法詳解,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

GoLang 逃逸分析的機制詳解

對於手動管理記憶體的語言，比如 C/C++，呼叫著名的malloc和new函式可以在堆上分配一塊記憶體，這塊記憶體的使用和銷燬的責任都在程式設計師。一不小心，就會發生記憶體洩露，搞得膽戰心驚。

TensorFlow 視訊記憶體使用機制詳解

預設情況下，TensorFlow 會對映程序可見的所有 GPU 的幾乎所有 GPU 記憶體（取決於 CUDA_VISIBLE_DEVICES）。通過減少記憶體碎片，可以更有效地使用裝置上相對寶貴的 GPU 記憶體資源。

Python動態匯入模組和反射機制詳解

一、前言何謂動態匯入模組，就是說模組的匯入可以根據我們的需求動態的去匯入，不是像一般的在程式碼檔案開頭固定的匯入所需的模組。

Java Volatile關鍵字同步機制詳解

Volatile關鍵字--最輕量級的同步機制1.保證了不同執行緒對這個變數進行操作時的可見性，即一個執行緒修改了某個變數的值，這新值對其他執行緒來說是立即可見的。（實現可見性）

Python迭代器協議及for迴圈工作機制詳解

一、遞迴與迭代二、什麼是迭代器協議 1、迭代器協議是指：物件必須提供一個next方法，執行該方法要麼返回迭代中的下一項，要麼就引起一個stopiteration異常，已終止迭代（只能往後走不能往前退）

從零開始手寫 mybatis（二）mybatis interceptor 外掛機制詳解

前景回顧第一節從零開始手寫 mybatis（一）MVP 版本中我們實現了一個最基本的可以執行的 mybatis。

從零開始手寫 mybatis（四）- mybatis 事務管理機制詳解

前景回顧第一節從零開始手寫 mybatis（一）MVP 版本中我們實現了一個最基本的可以執行的 mybatis。

java反射機制詳解

java 反射機制一、反射的概述 Java反射機制是 Java 語言的一個重要特性，它在伺服器程式和中介軟體程式中得到了廣泛運用。在伺服器端，往往需要根據客戶的請求，動態呼叫某一個物件的特定方法。此外，在 ORM 中介軟

[轉]Java垃圾回收（GC）機制詳解

一、為什麼需要垃圾回收　　如果不進行垃圾回收，記憶體遲早都會被消耗空，因為我們在不斷的分配記憶體空間而不進行回收。除非記憶體無限大，我們可以任性的分配而不回收，但是事實並非如此。所以，垃圾回收是必須的

io 多路複用機制詳解

　　伺服器端程式設計經常需要構造高效能的IO模型，常見的IO模型有四種：　　（1）同步阻塞IO（BlockingIO）：即傳統的IO模型。

java呼叫回撥機制詳解

呼叫和回撥機制在一個應用系統中,無論使用何種語言開發,必然存在模組之間的呼叫,呼叫的方式分為幾種:

【系統之音】WindowManager工作機制詳解

前言目光所及，皆有Window！Window，顧名思義，視窗，它是應用與使用者互動的一個視窗，我們所見到檢視，都對應著一個Window。比如螢幕上方的狀態列、下方的導航欄、按音量鍵調出來音量控制欄、充電時的充電介面、螢

ElasticSearch的評分機制詳解

1． 評分機制詳解

1.1． 評分機制 TF\IDF

1.1.1 演算法介紹

1.1.2 _score是如何被計算出來的

1.1.3 分析一個document是如何被匹配上的

1.2． Doc value

1.3． query phase

1.3.1、query phase

1.3.2、replica shard如何提升搜尋吞吐量

1.4． fetch phase

1.4.1、fetch phbase工作流程

1.4.2、一般搜尋，如果不加from和size，就預設搜尋前10條，按照_score排序

1.5． 搜尋引數小總結

1、preference

2、timeout

3、routing

4、search_type

相關推薦

1．評分機制詳解

1.1．評分機制 TF\IDF

1.5．搜尋引數小總結