大資料學習[17]--Elasticsearch 5.x 欄位摺疊的使用[轉]
作者:medcl
URL:https://elasticsearch.cn/article/132
在 Elasticsearch 5.x 有一個欄位摺疊(Field Collapsing,#22337)的功能非常有意思,在這裡分享一下,
欄位摺疊是一個很有歷史的需求了,可以看這個 issue,編號#256,最初是2010年7月提的issue,也是討論最多的帖子之一(240+評論),熬了6年才支援的特性,你說牛不牛,哈哈。
So,什麼是欄位摺疊,可以理解就是按特定欄位進行合併去重,比如我們有一個菜譜搜尋,我希望按菜譜的“菜系”欄位進行摺疊,即返回結果每個菜系都返回一個結果,也就是按菜系去重,我搜索關鍵字“魚”,要去返回的結果裡面各種菜系都有,有湘菜,有粵菜,有中餐,有西餐,別全是湘菜,就是這個意思,通過按特定欄位摺疊之後,來豐富搜尋結果的多樣性。
說到這裡,有人肯定會想到,使用 term agg+ top hits agg 來實現啊,這種組合兩種聚和的方式可以實現上面的功能,不過也有一些侷限性,比如,不能分頁,#4915;結果不夠精確(top term+top hits,es 的聚合實現選擇了犧牲精度來提高速度);資料量大的情況下,聚合比較慢,影響搜尋體驗。
而新的的欄位摺疊的方式是怎麼實現的的呢,有這些要點:
- 摺疊+取 inner_hits 分兩階段執行(組合聚合的方式只有一個階段),所以 top hits 永遠是精確的。
- 欄位摺疊只在 top hits 層執行,不需要每次都在完整的結果集上對為每個摺疊主鍵計算實際的 doc values 值,只對 top hits 這小部分資料操作就可以,和 term agg 相比要節省很多記憶體。
- 因為只在 top hits 上進行摺疊,所以相比組合聚合的方式,速度要快很多。
- 摺疊 top docs 不需要使用全域性序列(global ordinals)來轉換 string,相比 agg 這也節省了很多記憶體。
- 分頁成為可能,和常規搜尋一樣,具有相同的侷限,先獲取 from+size 的內容,再合併。
- search_after 和 scroll 暫未實現,不過具備可行性。
- 摺疊隻影響搜尋結果,不影響聚合,搜尋結果的 total 是所有的命中紀錄數,去重的結果數未知(無法計算)。
下面來看看具體的例子,就知道怎麼回事了,使用起來很簡單。
- 先準備索引和資料,這裡以菜譜為例,name:菜譜名,type 為菜系,rating 為使用者的累積平均評分
DELETE recipes
PUT recipes
POST recipes/type/_mapping
{
"properties": {
"name":{
"type": "text"
},
"rating":{
"type": "float"
},"type":{
"type": "keyword"
}
}
}
POST recipes/type/
{
"name":"清蒸魚頭","rating":1,"type":"湘菜"
}
POST recipes/type/
{
"name":"剁椒魚頭","rating":2,"type":"湘菜"
}
POST recipes/type/
{
"name":"紅燒鯽魚","rating":3,"type":"湘菜"
}
POST recipes/type/
{
"name":"鯽魚湯(辣)","rating":3,"type":"湘菜"
}
POST recipes/type/
{
"name":"鯽魚湯(微辣)","rating":4,"type":"湘菜"
}
POST recipes/type/
{
"name":"鯽魚湯(變態辣)","rating":5,"type":"湘菜"
}
POST recipes/type/
{
"name":"廣式鯽魚湯","rating":5,"type":"粵菜"
}
POST recipes/type/
{
"name":"魚香肉絲","rating":2,"type":"川菜"
}
POST recipes/type/
{
"name":"奶油鮑魚湯","rating":2,"type":"西菜"
}
- 現在我們看看普通的查詢效果是怎麼樣的,搜尋關鍵字帶“魚”的菜,返回3條資料
POST recipes/type/_search
{
"query": {"match": {
"name": "魚"
}},"size": 3
}
全是湘菜,我的天,最近上火不想吃辣,這個第一頁的結果對我來說就是垃圾,如下:{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0.26742277,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYF_OA-dG63Txsd",
"_score": 0.26742277,
"_source": {
"name": "鯽魚湯(變態辣)",
"rating": 5,
"type": "湘菜"
}
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHXO_OA-dG63Txsa",
"_score": 0.19100356,
"_source": {
"name": "紅燒鯽魚",
"rating": 3,
"type": "湘菜"
}
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHWy_OA-dG63TxsZ",
"_score": 0.19100356,
"_source": {
"name": "剁椒魚頭",
"rating": 2,
"type": "湘菜"
}
}
]
}
}
我們再看看,這次我想加個評分排序,大家都喜歡的是那些,看看有沒有喜歡吃的,執行查詢:POST recipes/type/_search
{
"query": {"match": {
"name": "魚"
}},"sort": [
{
"rating": {
"order": "desc"
}
}
],"size": 3
}
結果稍微好點了,不過3個裡面2個是湘菜,還是有點不合適,結果如下:{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYF_OA-dG63Txsd",
"_score": null,
"_source": {
"name": "鯽魚湯(變態辣)",
"rating": 5,
"type": "湘菜"
},
"sort": [
5
]
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYW_OA-dG63Txse",
"_score": null,
"_source": {
"name": "廣式鯽魚湯",
"rating": 5,
"type": "粵菜"
},
"sort": [
5
]
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHX7_OA-dG63Txsc",
"_score": null,
"_source": {
"name": "鯽魚湯(微辣)",
"rating": 4,
"type": "湘菜"
},
"sort": [
4
]
}
]
}
}
現在我知道了,我要看看其他菜系,這家不是還有西餐、廣東菜等各種菜系的麼,來來,幫我每個菜系來一個菜看看,換 terms agg 先得到唯一的 term 的 bucket,再組合 top_hits agg,返回按評分排序的第一個 top hits,有點複雜,沒關係,看下面的查詢就知道了:GET recipes/type/_search
{
"query": {
"match": {
"name": "魚"
}
},
"sort": [
{
"rating": {
"order": "desc"
}
}
],"aggs": {
"type": {
"terms": {
"field": "type",
"size": 10
},"aggs": {
"rated": {
"top_hits": {
"sort": [{
"rating": {"order": "desc"}
}],
"size": 1
}
}
}
}
},
"size": 0,
"from": 0
}
看下面的結果,雖然 json 結構有點複雜,不過總算是我們想要的結果了,湘菜、粵菜、川菜、西菜都出來了,每樣一個,不重樣:{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"type": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "湘菜",
"doc_count": 6,
"rated": {
"hits": {
"total": 6,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYF_OA-dG63Txsd",
"_score": null,
"_source": {
"name": "鯽魚湯(變態辣)",
"rating": 5,
"type": "湘菜"
},
"sort": [
5
]
}
]
}
}
},
{
"key": "川菜",
"doc_count": 1,
"rated": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYr_OA-dG63Txsf",
"_score": null,
"_source": {
"name": "魚香肉絲",
"rating": 2,
"type": "川菜"
},
"sort": [
2
]
}
]
}
}
},
{
"key": "粵菜",
"doc_count": 1,
"rated": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYW_OA-dG63Txse",
"_score": null,
"_source": {
"name": "廣式鯽魚湯",
"rating": 5,
"type": "粵菜"
},
"sort": [
5
]
}
]
}
}
},
{
"key": "西菜",
"doc_count": 1,
"rated": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHY3_OA-dG63Txsg",
"_score": null,
"_source": {
"name": "奶油鮑魚湯",
"rating": 2,
"type": "西菜"
},
"sort": [
2
]
}
]
}
}
}
]
}
}
}
上面的實現方法,前面已經說了,可以做,有侷限性,那看看新的欄位摺疊法如何做到呢,查詢如下,加一個 collapse 引數,指定對那個欄位去重就行了,這裡當然對菜系“type”欄位進行去重了:
GET recipes/type/_search
{
"query": {
"match": {
"name": "魚"
}
},
"collapse": {
"field": "type"
},
"size": 3,
"from": 0
}
結果很理想嘛,命中結果還是熟悉的那個味道(和查詢結果長的一樣嘛),如下:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoDNlRJ_OA-dG63TxpW",
"_score": 0.018980097,
"_source": {
"name": "鯽魚湯(微辣)",
"rating": 4,
"type": "湘菜"
},
"fields": {
"type": [
"湘菜"
]
}
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoDNlRk_OA-dG63TxpZ",
"_score": 0.013813315,
"_source": {
"name": "魚香肉絲",
"rating": 2,
"type": "川菜"
},
"fields": {
"type": [
"川菜"
]
}
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoDNlRb_OA-dG63TxpY",
"_score": 0.0125863515,
"_source": {
"name": "廣式鯽魚湯",
"rating": 5,
"type": "粵菜"
},
"fields": {
"type": [
"粵菜"
]
}
}
]
}
}
我再試試翻頁,把 from 改一下,現在返回了3條資料,from 改成3,新的查詢如下:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoDNlRw_OA-dG63Txpa",
"_score": 0.012546891,
"_source": {
"name": "奶油鮑魚湯",
"rating": 2,
"type": "西菜"
},
"fields": {
"type": [
"西菜"
]
}
}
]
}
}
上面的結果只有一條了,去重之後本來就只有4條資料,上面的工作正常,每個菜系只有一個菜啊,那我不樂意了,幫我每個菜系裡面多返回幾條,我好選菜啊,加上引數 inner_hits 來控制返回的條數,這裡返回2條,按 rating 也排個序,新的查詢構造如下:
GET recipes/type/_search
{
"query": {
"match": {
"name": "魚"
}
},
"collapse": {
"field": "type",
"inner_hits": {
"name": "top_rated",
"size": 2,
"sort": [
{
"rating": "desc"
}
]
}
},
"sort": [
{
"rating": {
"order": "desc"
}
}
],
"size": 2,
"from": 0
}
查詢結果如下,完美:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYF_OA-dG63Txsd",
"_score": null,
"_source": {
"name": "鯽魚湯(變態辣)",
"rating": 5,
"type": "湘菜"
},
"fields": {
"type": [
"湘菜"
]
},
"sort": [
5
],
"inner_hits": {
"top_rated": {
"hits": {
"total": 6,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYF_OA-dG63Txsd",
"_score": null,
"_source": {
"name": "鯽魚湯(變態辣)",
"rating": 5,
"type": "湘菜"
},
"sort": [
5
]
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHX7_OA-dG63Txsc",
"_score": null,
"_source": {
"name": "鯽魚湯(微辣)",
"rating": 4,
"type": "湘菜"
},
"sort": [
4
]
}
]
}
}
}
},
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYW_OA-dG63Txse",
"_score": null,
"_source": {
"name": "廣式鯽魚湯",
"rating": 5,
"type": "粵菜"
},
"fields": {
"type": [
"粵菜"
]
},
"sort": [
5
],
"inner_hits": {
"top_rated": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "recipes",
"_type": "type",
"_id": "AVoESHYW_OA-dG63Txse",
"_score": null,
"_source": {
"name": "廣式鯽魚湯",
"rating": 5,
"type": "粵菜"
},
"sort": [
5
]
}
]
}
}
}
}
]
}
}
好了,欄位摺疊介紹就到這裡。