ElasticSearch教程——cross-fields策略

阿新 • • 發佈：2018-11-19

ElasticSearch彙總請檢視：ElasticSearch教程——彙總篇

cross-fields搜尋，一個唯一標識，跨了多個field。比如一個人，標識，是姓名；一個建築，它的標識是地址。姓名可以散落在多個field中，比如first_name和last_name中，地址可以散落在country，province，city中。

跨多個field搜尋一個標識，比如搜尋一個人名，或者一個地址，就是cross-fields搜尋

初步來說，如果要實現，可能用most_fields比較合適。因為best_fields是優先搜尋單個field最匹配的結果，cross-fields本身就不是一個field的問題了。

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }

GET /forum/article/_search
{
  "query": {
    "multi_match": {
      "query":       "Peter Smith",
      "type":        "most_fields",
      "fields":      [ "author_first_name", "author_last_name" ]
    }
  }
}

Peter Smith，匹配author_first_name，匹配到了Smith，這時候它的分數很高，為什麼啊？？？
因為IDF分數高，IDF分數要高，那麼這個匹配到的term（Smith），在所有doc中的出現頻率要低，author_first_name field中，Smith就出現過1次
Peter Smith這個人，doc 1，Smith在author_last_name中，但是author_last_name出現了兩次Smith，所以導致doc 1的IDF分數較低。

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article",
          "sub_title": "learning more courses",
          "author_first_name": "Peter",
          "author_last_name": "Smith"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.51623213,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith"
        }
      }
    ]
  }
}

問題1：只是找到儘可能多的field匹配的doc，而不是某個field完全匹配的doc

問題2：most_fields，沒辦法用minimum_should_match去掉長尾資料，就是匹配的特別少的結果

問題3：TF/IDF演算法，比如Peter Smith和Smith Williams，搜尋Peter Smith的時候，由於first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分數很高，可能Smith Williams反而會排在Peter Smith前面

第一個辦法：用copy_to，將多個field組合成一個field
問題其實就出在有多個field，有多個field以後，就很尷尬，我們只要想辦法將一個標識跨在多個field的情況，合併成一個field即可。比如說，一個人名，本來是first_name，last_name，現在合併成一個full_name，不就ok了嗎。。。。。

PUT /forum/_mapping/article
{
  "properties": {
      "new_author_first_name": {
          "type":     "string",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_last_name": {
          "type":     "string",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_full_name": {
          "type":     "string"
      }
  }
}

用了這個copy_to語法之後，就可以將多個欄位的值拷貝到一個欄位中，並建立倒排索引

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} }		--> Peter Smith
{ "update": { "_id": "2"} }	
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} }		--> Smith Williams
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} }			--> Jack Ma
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} }			--> Robbin Li
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} }		--> Tonny Peter Smith

GET /forum/article/_search
{
  "query": {
    "match": {
      "new_author_full_name":       "Peter Smith"
    }
  }
}

問題1：只是找到儘可能多的field匹配的doc，而不是某個field完全匹配的doc --> 解決，最匹配的document被最先返回

問題2：most_fields，沒辦法用minimum_should_match去掉長尾資料，就是匹配的特別少的結果
--> 解決，可以使用minimum_should_match去掉長尾資料

問題3：TF/IDF演算法，比如Peter Smith和Smith Williams，搜尋Peter Smith的時候，由於first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分數很高，可能Smith Williams反而會排在Peter Smith前面 --> 解決，Smith和Peter在一個field了，所以在所有document中出現的次數是均勻的，不會有極端的偏差

multi_match + cross_fields

GET /forum/article/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "type": "cross_fields", 
      "operator": "and",
      "fields": ["author_first_name", "author_last_name"]
    }
  }
}

問題1：只是找到儘可能多的field匹配的doc，而不是某個field完全匹配的doc --> 解決，要求每個term都必須在任何一個field中出現

Peter，Smith

要求Peter必須在author_first_name或author_last_name中出現
要求Smith必須在author_first_name或author_last_name中出現

Peter Smith可能是橫跨在多個field中的，所以必須要求每個term都在某個field中出現，組合起來才能組成我們想要的標識，完整的人名

原來most_fiels，可能像Smith Williams也可能會出現，因為most_fields要求只是任何一個field匹配了就可以，匹配的field越多，分數越高

問題2：most_fields，沒辦法用minimum_should_match去掉長尾資料，就是匹配的特別少的結果 --> 解決，既然每個term都要求出現，長尾肯定被去除掉了

java hadoop spark --> 這3個term都必須在任何一個field出現了

比如有的document，只有一個field中包含一個java，那就被幹掉了，作為長尾就沒了

問題3：TF/IDF演算法，比如Peter Smith和Smith Williams，搜尋Peter Smith的時候，由於first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分數很高，可能Smith Williams反而會排在Peter Smith前面 --> 計算IDF的時候，將每個query在每個field中的IDF都取出來，取最小值，就不會出現極端情況下的極大值了

Peter Smith

Peter
Smith

Smith，在author_first_name這個field中，在所有doc的這個Field中，出現的頻率很低，導致IDF分數很高；Smith在所有doc的author_last_name field中的頻率算出一個IDF分數，因為一般來說last_name中的Smith頻率都較高，所以IDF分數是正常的，不會太高；然後對於Smith來說，會取兩個IDF分數中，較小的那個分數。就不會出現IDF分過高的情況。

ElasticSearch教程——cross-fields策略

ElasticSearch教程——cross-fields策略

ElasticSearch教程——best fields,most fields策略

Elasticsearch教程

Elasticsearch教程 Elasticsearch查詢語法 Elasticsearch權威指南深入理解Elasticsearch

11.best fields策略（dis_max參數設置）

16.copy_to定制組合field解決cross-fields搜索弊端

Centos安裝elasticsearch教程

ElasticSearch教程——relevance score

ElasticSearch教程——精準全文檢索

ElasticSearch教程——filter執行原理深度剖析（bitset機制與caching機制）

ElasticSearch教程——深入剖析Document寫入原理以及優化操作

ElasticSearch教程——倒排索引及其資料結構以及優缺點

ElasticSearch教程——自定義分詞器

Linux下使用ElasticSearch教程（一）

SpringBoot下使用ElasticSearch教程(一)

ES篇：ElasticSearch教程——建立索引、型別、文件

ElasticSearch教程——proximity match 近似匹配

Elasticsearch 教程（一）：基礎入門

elasticsearch教程--中文分詞器作用和使用

centos 7( linux )下安裝elasticsearch教程

ElasticSearch教程——cross-fields策略

相關推薦