1. 程式人生 > >大資料學習[17]--Elasticsearch 5.x 欄位摺疊的使用[轉]

大資料學習[17]--Elasticsearch 5.x 欄位摺疊的使用[轉]

題目:Elasticsearch 5.x 欄位摺疊的使用
作者:medcl
URL:https://elasticsearch.cn/article/132


在 Elasticsearch 5.x 有一個欄位摺疊(Field Collapsing,#22337)的功能非常有意思,在這裡分享一下,

欄位摺疊是一個很有歷史的需求了,可以看這個 issue,編號#256,最初是2010年7月提的issue,也是討論最多的帖子之一(240+評論),熬了6年才支援的特性,你說牛不牛,哈哈。

So,什麼是欄位摺疊,可以理解就是按特定欄位進行合併去重,比如我們有一個菜譜搜尋,我希望按菜譜的“菜系”欄位進行摺疊,即返回結果每個菜系都返回一個結果,也就是按菜系去重,我搜索關鍵字“魚”,要去返回的結果裡面各種菜系都有,有湘菜,有粵菜,有中餐,有西餐,別全是湘菜,就是這個意思,通過按特定欄位摺疊之後,來豐富搜尋結果的多樣性。

說到這裡,有人肯定會想到,使用 term agg+ top hits agg 來實現啊,這種組合兩種聚和的方式可以實現上面的功能,不過也有一些侷限性,比如,不能分頁,#4915;結果不夠精確(top term+top hits,es 的聚合實現選擇了犧牲精度來提高速度);資料量大的情況下,聚合比較慢,影響搜尋體驗。

而新的的欄位摺疊的方式是怎麼實現的的呢,有這些要點:

  1. 摺疊+取 inner_hits 分兩階段執行(組合聚合的方式只有一個階段),所以 top hits 永遠是精確的。
  2. 欄位摺疊只在 top hits 層執行,不需要每次都在完整的結果集上對為每個摺疊主鍵計算實際的 doc values 值,只對 top hits 這小部分資料操作就可以,和 term agg 相比要節省很多記憶體。
  3. 因為只在 top hits 上進行摺疊,所以相比組合聚合的方式,速度要快很多。
  4. 摺疊 top docs 不需要使用全域性序列(global ordinals)來轉換 string,相比 agg 這也節省了很多記憶體。
  5. 分頁成為可能,和常規搜尋一樣,具有相同的侷限,先獲取 from+size 的內容,再合併。
  6. search_after 和 scroll 暫未實現,不過具備可行性。
  7. 摺疊隻影響搜尋結果,不影響聚合,搜尋結果的 total 是所有的命中紀錄數,去重的結果數未知(無法計算)。

下面來看看具體的例子,就知道怎麼回事了,使用起來很簡單。

  • 先準備索引和資料,這裡以菜譜為例,name:菜譜名,type 為菜系,rating 為使用者的累積平均評分
DELETE recipes
PUT recipes
POST recipes/type/_mapping
{
  "properties": {
    "name":{
      "type": "text"
    },
    "rating":{
      "type": "float"
    },"type":{
      "type": "keyword"
    }
  }
}
POST recipes/type/
{
  "name":"清蒸魚頭","rating":1,"type":"湘菜"
}

POST recipes/type/
{
  "name":"剁椒魚頭","rating":2,"type":"湘菜"
}

POST recipes/type/
{
  "name":"紅燒鯽魚","rating":3,"type":"湘菜"
}

POST recipes/type/
{
  "name":"鯽魚湯(辣)","rating":3,"type":"湘菜"
}

POST recipes/type/
{
  "name":"鯽魚湯(微辣)","rating":4,"type":"湘菜"
}

POST recipes/type/
{
  "name":"鯽魚湯(變態辣)","rating":5,"type":"湘菜"
}

POST recipes/type/
{
  "name":"廣式鯽魚湯","rating":5,"type":"粵菜"
}

POST recipes/type/
{
  "name":"魚香肉絲","rating":2,"type":"川菜"
}

POST recipes/type/
{
  "name":"奶油鮑魚湯","rating":2,"type":"西菜"
} 
  • 現在我們看看普通的查詢效果是怎麼樣的,搜尋關鍵字帶“魚”的菜,返回3條資料
POST recipes/type/_search
{
  "query": {"match": {
    "name": "魚"
  }},"size": 3
} 
全是湘菜,我的天,最近上火不想吃辣,這個第一頁的結果對我來說就是垃圾,如下:
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 0.26742277,
    "hits":  [
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHYF_OA-dG63Txsd",
        "_score": 0.26742277,
        "_source": {
          "name": "鯽魚湯(變態辣)",
          "rating": 5,
          "type": "湘菜"
        }
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHXO_OA-dG63Txsa",
        "_score": 0.19100356,
        "_source": {
          "name": "紅燒鯽魚",
          "rating": 3,
          "type": "湘菜"
        }
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHWy_OA-dG63TxsZ",
        "_score": 0.19100356,
        "_source": {
          "name": "剁椒魚頭",
          "rating": 2,
          "type": "湘菜"
        }
      }
    ]
  }
}
我們再看看,這次我想加個評分排序,大家都喜歡的是那些,看看有沒有喜歡吃的,執行查詢:
POST recipes/type/_search
{
  "query": {"match": {
    "name": "魚"
  }},"sort":  [
    {
      "rating": {
        "order": "desc"
      }
    }
  ],"size": 3
} 
結果稍微好點了,不過3個裡面2個是湘菜,還是有點不合適,結果如下:
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": null,
    "hits":  [
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHYF_OA-dG63Txsd",
        "_score": null,
        "_source": {
          "name": "鯽魚湯(變態辣)",
          "rating": 5,
          "type": "湘菜"
        },
        "sort":  [
          5
        ]
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHYW_OA-dG63Txse",
        "_score": null,
        "_source": {
          "name": "廣式鯽魚湯",
          "rating": 5,
          "type": "粵菜"
        },
        "sort":  [
          5
        ]
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHX7_OA-dG63Txsc",
        "_score": null,
        "_source": {
          "name": "鯽魚湯(微辣)",
          "rating": 4,
          "type": "湘菜"
        },
        "sort":  [
          4
        ]
      }
    ]
  }
}
現在我知道了,我要看看其他菜系,這家不是還有西餐、廣東菜等各種菜系的麼,來來,幫我每個菜系來一個菜看看,換 terms agg 先得到唯一的 term 的 bucket,再組合 top_hits agg,返回按評分排序的第一個 top hits,有點複雜,沒關係,看下面的查詢就知道了:
GET recipes/type/_search
{
  "query": {
    "match": {
      "name": "魚"
    }
  },
  "sort":  [
    {
      "rating": {
        "order": "desc"
      }
    }
  ],"aggs": {
    "type": {
      "terms": {
        "field": "type",
        "size": 10
      },"aggs": {
        "rated": {
          "top_hits": {
            "sort":  [{
              "rating": {"order": "desc"}
            }], 
            "size": 1
          }
        }
      }
    }
  }, 
  "size": 0,
  "from": 0
} 
看下面的結果,雖然 json 結構有點複雜,不過總算是我們想要的結果了,湘菜、粵菜、川菜、西菜都出來了,每樣一個,不重樣:
{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": 0,
    "hits":  []
  },
  "aggregations": {
    "type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets":  [
        {
          "key": "湘菜",
          "doc_count": 6,
          "rated": {
            "hits": {
              "total": 6,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHYF_OA-dG63Txsd",
                  "_score": null,
                  "_source": {
                    "name": "鯽魚湯(變態辣)",
                    "rating": 5,
                    "type": "湘菜"
                  },
                  "sort":  [
                    5
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "川菜",
          "doc_count": 1,
          "rated": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHYr_OA-dG63Txsf",
                  "_score": null,
                  "_source": {
                    "name": "魚香肉絲",
                    "rating": 2,
                    "type": "川菜"
                  },
                  "sort":  [
                    2
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "粵菜",
          "doc_count": 1,
          "rated": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHYW_OA-dG63Txse",
                  "_score": null,
                  "_source": {
                    "name": "廣式鯽魚湯",
                    "rating": 5,
                    "type": "粵菜"
                  },
                  "sort":  [
                    5
                  ]
                }
              ]
            }
          }
        },
        {
          "key": "西菜",
          "doc_count": 1,
          "rated": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHY3_OA-dG63Txsg",
                  "_score": null,
                  "_source": {
                    "name": "奶油鮑魚湯",
                    "rating": 2,
                    "type": "西菜"
                  },
                  "sort":  [
                    2
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}
上面的實現方法,前面已經說了,可以做,有侷限性,那看看新的欄位摺疊法如何做到呢,查詢如下,加一個 collapse 引數,指定對那個欄位去重就行了,這裡當然對菜系“type”欄位進行去重了:
GET recipes/type/_search
{
  "query": {
    "match": {
      "name": "魚"
    }
  },
  "collapse": {
    "field": "type"
  },
  "size": 3,
  "from": 0
}

結果很理想嘛,命中結果還是熟悉的那個味道(和查詢結果長的一樣嘛),如下:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": null,
    "hits":  [
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoDNlRJ_OA-dG63TxpW",
        "_score": 0.018980097,
        "_source": {
          "name": "鯽魚湯(微辣)",
          "rating": 4,
          "type": "湘菜"
        },
        "fields": {
          "type":  [
            "湘菜"
          ]
        }
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoDNlRk_OA-dG63TxpZ",
        "_score": 0.013813315,
        "_source": {
          "name": "魚香肉絲",
          "rating": 2,
          "type": "川菜"
        },
        "fields": {
          "type":  [
            "川菜"
          ]
        }
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoDNlRb_OA-dG63TxpY",
        "_score": 0.0125863515,
        "_source": {
          "name": "廣式鯽魚湯",
          "rating": 5,
          "type": "粵菜"
        },
        "fields": {
          "type":  [
            "粵菜"
          ]
        }
      }
    ]
  }
}

我再試試翻頁,把 from 改一下,現在返回了3條資料,from 改成3,新的查詢如下:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": null,
    "hits":  [
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoDNlRw_OA-dG63Txpa",
        "_score": 0.012546891,
        "_source": {
          "name": "奶油鮑魚湯",
          "rating": 2,
          "type": "西菜"
        },
        "fields": {
          "type":  [
            "西菜"
          ]
        }
      }
    ]
  }
}

上面的結果只有一條了,去重之後本來就只有4條資料,上面的工作正常,每個菜系只有一個菜啊,那我不樂意了,幫我每個菜系裡面多返回幾條,我好選菜啊,加上引數 inner_hits 來控制返回的條數,這裡返回2條,按 rating 也排個序,新的查詢構造如下:


GET recipes/type/_search
{
  "query": {
    "match": {
      "name": "魚"
    }
  },
  "collapse": {
    "field": "type",
    "inner_hits": {
      "name": "top_rated",
      "size": 2,
      "sort":  [
        {
          "rating": "desc"
        }
      ]
    }
  },
  "sort":  [
    {
      "rating": {
        "order": "desc"
      }
    }
  ],
  "size": 2,
  "from": 0
}

查詢結果如下,完美:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 9,
    "max_score": null,
    "hits":  [
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHYF_OA-dG63Txsd",
        "_score": null,
        "_source": {
          "name": "鯽魚湯(變態辣)",
          "rating": 5,
          "type": "湘菜"
        },
        "fields": {
          "type":  [
            "湘菜"
          ]
        },
        "sort":  [
          5
        ],
        "inner_hits": {
          "top_rated": {
            "hits": {
              "total": 6,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHYF_OA-dG63Txsd",
                  "_score": null,
                  "_source": {
                    "name": "鯽魚湯(變態辣)",
                    "rating": 5,
                    "type": "湘菜"
                  },
                  "sort":  [
                    5
                  ]
                },
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHX7_OA-dG63Txsc",
                  "_score": null,
                  "_source": {
                    "name": "鯽魚湯(微辣)",
                    "rating": 4,
                    "type": "湘菜"
                  },
                  "sort":  [
                    4
                  ]
                }
              ]
            }
          }
        }
      },
      {
        "_index": "recipes",
        "_type": "type",
        "_id": "AVoESHYW_OA-dG63Txse",
        "_score": null,
        "_source": {
          "name": "廣式鯽魚湯",
          "rating": 5,
          "type": "粵菜"
        },
        "fields": {
          "type":  [
            "粵菜"
          ]
        },
        "sort":  [
          5
        ],
        "inner_hits": {
          "top_rated": {
            "hits": {
              "total": 1,
              "max_score": null,
              "hits":  [
                {
                  "_index": "recipes",
                  "_type": "type",
                  "_id": "AVoESHYW_OA-dG63Txse",
                  "_score": null,
                  "_source": {
                    "name": "廣式鯽魚湯",
                    "rating": 5,
                    "type": "粵菜"
                  },
                  "sort":  [
                    5
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }
}

好了,欄位摺疊介紹就到這裡。