Elasticsearch資料的聚合查詢
聚合框架有助於根據搜尋查詢提供聚合資料。聚合查詢是資料庫中重要的功能特性,ES作為搜尋引擎兼資料庫,同樣提供了強大的聚合分析能力。它基於查詢條件來對資料進行分桶、計算的方法。有點類似於 SQL 中的 group by 再加一些函式方法的操作。聚合可以巢狀,由此可以組成複雜的操作(Bucketing聚合可以包含sub-aggregation)。
聚合計算的值可以取欄位的值,也可是指令碼計算的結果。查詢請求體中以aggregations節點的語法定義:
"aggregations" : { //也可簡寫為 aggs "<aggregation_name>" : { //聚合的名字 "<aggregation_type>" : { //聚合的型別 <aggregation_body> //聚合體:對哪些欄位進行聚合 } [,"meta" : { [<meta_data_body>] } ]? //元 [,"aggregations" : { [<sub_aggregation>]+ } ]? //在聚合裡面在定義子聚合 } [,"<aggregation_name_2>" : { ... } ]* //聚合的名字 }
1、資料準備
(1)建立員工索引employee
PUT employee { "mappings": { "properties": { "id": { "type": "integer" }, "name": { "type": "keyword" }, "job": { "type": "keyword" }, "age": { "type": "integer" }, "gender": { "type": "keyword" } } }, "settings":{ "index":{ "number_of_shards":3, #分片數量 "number_of_replicas":2 #副本數量 } } }
(2)插入資料
POST employee/_bulk {"index": {"_id": 1}} {"id": 1, "name": "Bob", "job": "java", "age": 21, "sal": 8000, "gender": "male"} {"index": {"_id": 2}} {"id": 2, "name": "Rod", "job": "html", "age": 31, "sal": 18000, "gender": "female"} {"index": {"_id": 3}} {"id": 3, "name": "Gaving", "job": "java", "age": 24, "sal": 12000, "gender": "male"} {"index": {"_id": 4}} {"id": 4, "name": "King", "job": "dba", "age": 26, "sal": 15000, "gender": "female"} {"index": {"_id": 5}} {"id": 5, "name": "Jonhson", "job": "dba", "age": 29, "sal": 16000, "gender": "male"} {"index": {"_id": 6}} {"id": 6, "name": "Douge", "job": "java", "age": 41, "sal": 20000, "gender": "female"} {"index": {"_id": 7}} {"id": 7, "name": "cutting", "job": "dba", "age": 27, "sal": 7000, "gender": "male"} {"index": {"_id": 8}} {"id": 8, "name": "Bona", "job": "html", "age": 22, "sal": 14000, "gender": "female"} {"index": {"_id": 9}} {"id": 9, "name": "Shyon", "job": "dba", "age": 20, "sal": 19000, "gender": "female"} {"index": {"_id": 10}} {"id": 10, "name": "James", "job": "html", "age": 18, "sal": 22000, "gender": "male"} {"index": {"_id": 11}} {"id": 11, "name": "Golsling", "job": "java", "age": 32, "sal": 23000, "gender": "female"} {"index": {"_id": 12}} {"id": 12, "name": "Lily", "job": "java", "age": 24, "sal": 2000, "gender": "male"} {"index": {"_id": 13}} {"id": 13, "name": "Jack", "job": "html", "age": 23, "sal": 3000, "gender": "female"} {"index": {"_id": 14}} {"id": 14, "name": "Rose", "job": "java", "age": 36, "sal": 6000, "gender": "female"} {"index": {"_id": 15}} {"id": 15, "name": "Will", "job": "dba", "age": 38, "sal": 4500, "gender": "male"} {"index": {"_id": 16}} {"id": 16, "name": "smith", "job": "java", "age": 32, "sal": 23000, "gender": "male"} #這裡有換行符
資料說明:插入的資料為員工資訊,name是員工的姓名,job是員工的工種,age為員工的年齡,sal為員工的薪水,gender為員工的性別。
指標聚合
指標聚合,它是對文件進行一些權值計算(比如求所有文件某個欄位求最大、最小、和、平均值),輸出結果往往是文件的權值,相當於為文件添加了一些統計資訊。
它基於特定欄位(field)或指令碼值(generated using scripts),計算聚合中文件的數值權值。數值權值聚合(注意分類只針對數值權值聚合,非數值的無此分類)輸出單個權值的,也叫 single-value numeric metrics,其它生成多個權值(比如:stats)的被叫做 multi-value numeric metrics。
max min sum avg
- Max Aggregation,求最大值。基於文件的某個值(可以是特定的數值型欄位,也可以通過指令碼計算而來),計算該值在聚合文件中的均值。
- Min Aggregation,求最小值。同上
- Sum Aggregation,求和。同上
- Avg Aggregation,求平均數。同上
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"max_sal": {
"max": { "field": "sal"}
}
}
}
返回結果
{
"took": 40,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"max_sal": {
"value": 23000
}
}
}
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"min_sal": {
"min": { "field": "sal"}
}
}
}
返回結果
{
"took": 40,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"min_sal": {
"value": 2000
}
}
}
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"sum_sal": {
"sum": { "field": "sal"}
}
}
}
返回結果
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sum_sal": {
"value": 212500
}
}
}
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"avg_sal": {
"avg": { "field": "sal"}
}
}
}
返回結果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"avg_sal": {
"value": 13281.25
}
}
}
值統計
值計數聚合。計算聚合文件中某個值(可以是特定的數值型欄位,也可以通過指令碼計算而來)的個數。該聚合一般與其它 single-value 聚合聯合使用,比如在計算一個欄位的平均值的時候,可能還會關注這個平均值是由多少個值計算而來。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_count": {
"value_count": { "field": "age"}
}
}
}
返回結果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_count": {
"value": 16
}
}
}
distinct 聚合
基數聚合。它屬於multi-value,基於文件的某個值(可以是特定的欄位,也可以通過指令碼計算而來),計算文件非重複的個數(去重計數),相當於sql中的distinct。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"job_count": {
"cardinality": {
"field": "job"
}
}
}
}
返回結果
{
"took": 32,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"job_count": {
"value": 3
},
"age_count": {
"value": 14
}
}
}
統計聚合
統計聚合。它屬於multi-value,基於文件的某個值(可以是特定的數值型欄位,也可以通過指令碼計算而來),計算出一些統計資訊(min、max、sum、count、avg5個值)。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
返回結果
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_stats": {
"count": 16,
"min": 18,
"max": 41,
"avg": 27.75,
"sum": 444
}
}
}
拓展的統計聚合
擴充套件統計聚合。它屬於multi-value,比stats多4個統計結果: 平方和、方差、標準差、平均值加/減兩個標準差的區間。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}
返回結果
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_stats": {
"count": 16,
"min": 18,
"max": 41,
"avg": 27.75,
"sum": 444,
"sum_of_squares": 13006,
"variance": 42.8125,
"variance_population": 42.8125,
"variance_sampling": 45.666666666666664,
"std_deviation": 6.5431261641512,
"std_deviation_population": 6.5431261641512,
"std_deviation_sampling": 6.757711644237764,
"std_deviation_bounds": {
"upper": 40.8362523283024,
"lower": 14.6637476716976,
"upper_population": 40.8362523283024,
"lower_population": 14.6637476716976,
"upper_sampling": 41.26542328847553,
"lower_sampling": 14.234576711524472
}
}
}
}
百分比統計
百分比聚合。它屬於multi-value,對指定欄位(指令碼)的值按從小到大累計每個值對應的文件數的佔比(佔所有命中文件數的百分比),返回指定佔比比例對應的值。預設返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}
返回結果
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_percents": {
"values": {
"1.0": 18,
"5.0": 18.6,
"25.0": 22.5,
"50.0": 26.5, //佔比為50%的文件的age值 <= 26.5,或反過來:age<=26.5的文件數佔總命中文件數的50%
"75.0": 32,
"95.0": 40.099999999999994,
"99.0": 41
}
}
}
}
指定分位值
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents": [95,99,99.9]
}
}
}
}
返回結果
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_percents": {
"values": {
"95.0": 40.099999999999994,
"99.0": 41,
"99.9": 41
}
}
}
}
百分比排名聚合
統計年齡小於25和年齡小於30的文件的佔比,這裡需求可以使用。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [25,30]
}
}
}
}
返回結果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"gge_perc_rank": {
"values": { //年齡小於25的文件佔比為43.75%,年齡小於30的文件佔比為62.5%
"25.0": 43.75,
"30.0": 62.5
}
}
}
}
Top Hits
最高匹配權值聚合。獲取到每組前n條資料,相當於sql 中Top(group by 後取出前n條)。它跟蹤聚合中相關性最高的文件,該聚合一般用做 sub-aggregation,以此來聚合每個桶中的最高匹配的文件,較為常用的統計。
POST employee/_doc/_search
{
"size":0,
"query": {
"match_all": {}
},
"aggs": {
"group_by_job": {
"terms": {
"field": "job",
"size": 2 //返回的buckets陣列長度
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 5 //返回的最大文件個數
}
}
}
}
}
}
返回結果
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"group_by_job": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 9,
"buckets": [
{
"key": "java",
"doc_count": 7,
"top_tag_hits": {
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "employee",
"_type": "_doc",
"_id": "3",
"_score": 1,
"_source": {
"id": 3,
"name": "Gaving",
"job": "java",
"age": 24,
"sal": 12000,
"gender": "male"
}
}
]
}
}
}
]
}
}
}
Geo Bounds Aggregation
地理邊界聚合。基於文件的某個欄位(geo-point型別欄位),計算出該欄位所有地理座標點的邊界(左上角/右下角座標點)。
POST region/_doc/_search
{
"size": 0
"query": {
"match_all": {}
},
"aggs": {
"viewport": {
"geo_bounds": {
"field": "location",
"wrap_longitude": true //是否允許地理邊界與國際日界線存在重疊
}
}
}
}
Geo Centroid Aggregation
地理重心聚合。基於文件的某個欄位(geo-point型別欄位),計算所有座標的加權重心。
POST region/_doc/_search
{
"query" : {
"match" : { "crime" : "burglary" }
},
"aggs" : {
"centroid" : {
"geo_centroid" : {
"field" : "location"
}
}
}
}
桶聚合
它執行的是對文件分組的操作(與sql中的group by類似),把滿足相關特性的文件分到一個桶裡,即桶分,輸出結果往往是一個個包含多個文件的桶(一個桶就是一個group)。
它有一個關鍵字(field、script),以及一些桶分(分組)的判斷條件。執行聚合操作時候,文件會判斷每一個分組條件,如果滿足某個,該文件就會被分為該組(fall in)。
它不進行權值的計算,他們對文件根據聚合請求中提供的判斷條件(比如:{"from":0, "to":100})來進行分組(桶分)。桶聚合還會額外返回每一個桶內文件的個數。
它可以包含子聚合——sub-aggregations(權值聚合不能包含子聚合,可以作為子聚合),子聚合操作將會應用到由父聚合產生的每一個桶上。
它根據聚合條件,可以只定義輸出一個桶;也可以輸出多個(multi-bucket);還可以在根據聚合條件動態確定桶個數(比如:terms aggregation)
Terms Aggregation
詞聚合。基於某個field,該 field 內的每一個【唯一詞元】為一個桶,並計算每個桶內文件個數。預設返回順序是按照文件個數多少排序。它屬於multi-bucket。當不返回所有 buckets 的情況(它size控制),文件個數可能不準確。
POST employee/_doc/_search
{
"size": 0, //表示返回的資料為0,一般用於統計、聚合,不需要返回實際的列表
"aggs": {
"age_terms": {
"terms": {
"field": "job", //欄位
"size": 10, //size用來定義需要返回多個 buckets(防止太多),預設會全部返回。
"order": {"_count": "asc"}, //根據文件計數排序,根據分組值排序({ "_key" : "asc" })
"min_doc_count": 1, //只返回文件個數不小於該值的 buckets
"include": ".*dba.*", //包含過濾,根據欄位關鍵字過濾
"exclude": "html.*", //排除過濾,根據欄位關鍵字過濾
"missing": "N/A"
}
}
}
}
返回結果
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "dba",
"doc_count": 5
}
]
}
}
}
指定每個分片返回多少個分組
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_terms": {
"terms": {
"field": "job",
"size": 10,
"shard_size": 20,//指定每個分片返回多少個分組,預設值(索引只有一個分片:= size,多分片:= size * 1.5 + 10)
"show_term_doc_count_error": true //每個分組上顯示偏差值
}
}
}
}
返回結果
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,//文件計數的最大偏差值
"sum_other_doc_count": 0,//未返回的其他項的文件數
"buckets": [ //預設情況下返回按文件計數從高到低的前10個分組
{
"key": "java", //job為java的文件有7個
"doc_count": 7,
"doc_count_error_upper_bound": 0
},
{
"key": "dba", //job為dba的文件有5個
"doc_count": 5,
"doc_count_error_upper_bound": 0
},
{
"key": "html",
"doc_count": 4,
"doc_count_error_upper_bound": 0
}
]
}
}
}
Filter Aggregation
過濾聚合。基於一個條件,來對當前的文件進行過濾的聚合。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"args_term": {
"filter": {
"match": {
"job": "java"
}
},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
返回結果
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"args_term": {
"doc_count": 7,
"avg_age": {
"value": 30
}
}
}
}
Filters Aggregation
多過濾聚合。基於多個過濾條件,來對當前文件進行【過濾】的聚合,每個過濾都包含所有滿足它的文件(多個bucket中可能重複),先過濾再聚合。它屬於multi-bucket。
範圍聚合
範圍分組聚合。基於某個值(可以是 field 或 script),以【欄位範圍】來桶分聚合。範圍聚合包括 from 值,不包括 to 值(區間前閉後開)。它屬於multi-bucket。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"to": 25
},
{
"from": 25,
"to": 35
},
{
"from": 35
}
]
},
"aggs": {
"bmax": {
"max": {
"field": "sal"
}
}
}
}
}
}
返回結果
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_range": {
"buckets": [
{
"key": "*-25.0",
"to": 25,
"doc_count": 7,
"bmax": {
"value": 22000
}
},
{
"key": "25.0-35.0",
"from": 25,
"to": 35,
"doc_count": 6,
"bmax": {
"value": 23000
}
},
{
"key": "35.0-*",
"from": 35,
"doc_count": 3,
"bmax": {
"value": 20000
}
}
]
}
}
}
時間範圍聚合
日期範圍聚合。基於日期型別的值,以【日期範圍】來桶分聚合。日期範圍可以用各種 Date Math 表示式。同樣的,包括 from 的值,不包括 to 的值。它屬於multi-bucket。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{
"to": "now-10M/M"
},
{
"from": "now-10M/M"
}
]
}
}
}
}
返回結果
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"range": {
"buckets": [
{
"key": "*-01-2021",
"to": 1609459200000,
"to_as_string": "01-2021",
"doc_count": 0
},
{
"key": "01-2021-*",
"from": 1609459200000,
"from_as_string": "01-2021",
"doc_count": 0
}
]
}
}
}
時間柱狀聚合
1、直方圖聚合。基於文件中的某個【數值型別】欄位,通過計算來動態的分桶。它屬於multi-bucket。
POST employee/_doc/_search
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "sal", //欄位,必須為數值型別
"interval": 50, //分桶間距
"min_doc_count": 1, //最少文件數桶過濾,只有不少於這麼多文件的桶才會返回
"extended_bounds": { //範圍擴充套件
"min": 0,
"max": 500
},
"order": {
"_count": "desc" //對桶排序,如果 histogram 聚合有一個權值聚合型別的"直接"子聚合,那麼排序可以使用子聚合中的結果
},
"keyed": true, //hash結構返回,預設以陣列形式返回每一個桶
"missing": 0 //配置預設預設值
}
}
}
}
返回結果
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 16,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"prices": {
"buckets": {
"23000.0": {
"key": 23000,
"doc_count": 2
},
"2000.0": {
"key": 2000,
"doc_count": 1
},
"3000.0": {
"key": 3000,
"doc_count": 1
},
"4500.0": {
"key": 4500,
"doc_count": 1
},
"6000.0": {
"key": 6000,
"doc_count": 1
},
"7000.0": {
"key": 7000,
"doc_count": 1
},
"8000.0": {
"key": 8000,
"doc_count": 1
},
"12000.0": {
"key": 12000,
"doc_count": 1
},
"14000.0": {
"key": 14000,
"doc_count": 1
},
"15000.0": {
"key": 15000,
"doc_count": 1
},
"16000.0": {
"key": 16000,
"doc_count": 1
},
"18000.0": {
"key": 18000,
"doc_count": 1
},
"19000.0": {
"key": 19000,
"doc_count": 1
},
"20000.0": {
"key": 20000,
"doc_count": 1
},
"22000.0": {
"key": 22000,
"doc_count": 1
}
}
}
}
}