Elastic Stack 筆記(七)Elasticsearch5.6 聚合分析
博客地址:http://www.moonxy.com
一、前言
Elasticsearch 是一個分布式的全文搜索引擎,索引和搜索是 Elasticsarch 的基本功能。同時,Elasticsearch 的聚合(Aggregations)功能也時分強大,允許在數據上做復雜的分析統計。ES 提供的聚合分析功能主要有指標聚合、桶聚合、管道聚合和矩陣聚合。需要主要掌握的是前兩個,即標聚合和桶聚合。
聚合分析的官方文檔:Aggregations
二、聚合分析
2.1 指標聚合
指標聚合官網文檔:Metric
指標聚合中包括如下聚合:
- Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top Hits Aggregation
- Value Count Aggregation
指標聚合中主要包括 min、max、sum、avg、stats、extended_stats、value_count 等聚合。
Aggregations that keep track and compute metrics over a set of documents.
在一組文檔中跟蹤和計算度量的聚合。如下以 max 聚合為例:
Max Aggregation
max 聚合官網文檔:Max Aggregation
max 聚合用於最大值統計,與 SQL 中的聚合函數 max() 的作用類似,其中 "max_price" 為自定義的聚合名稱。
##Max Aggregation GET books/_search { "size": 0, "aggs": { "max_price": {"max": { "field": "price" } } } }
返回結果如下:
{ "took": 6, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "max_price": { "value": 81.4 } } }
Cardinality Aggregation
基數統計聚合官網文檔:Cardinality Aggregation
Cardinality Aggregation 用於基數查詢,其作用是先執行類似 SQL 中的 distinct 操作,去掉集合中的重復項,然後統計排重後的集合長度。
##Cardinality Aggregation GET books/_search { "size": 0, "aggs": { "all_language": { "cardinality": { "field": "language" } } } }
返回結果如下:
{ "took": 41, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "all_language": { "value": 3 } } }
Stats Aggregation
基本統計聚合官網文檔:Stats Aggregation
Stats Aggregation 用於基本統計,會一次返回 count、max、min、avg 和 sum 這 5 個指標。如下:
##Stats Aggregation GET books/_search { "size": 0, "aggs": { "stats_pirce": { "stats": { "field": "price" } } } }
返回結果如下:
{ "took": 5, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "stats_pirce": { "count": 5, "min": 46.5, "max": 81.4, "avg": 63.8, "sum": 319 } } }
Extended Stats Aggregation
高級統計聚合官網文檔:Extended Stats Aggregation
用於高級統計,和基本統計功能類似,但是會比基本統計多4個統計結果:平方和、方差、標準差、平均值加/減兩個標準差的區間。
##Extended Stats Aggregation GET books/_search { "size": 0, "aggs": { "extend_stats_pirce": { "extended_stats": { "field": "price" } } } }
返回響應結果:
{ "took": 14, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "extend_stats_pirce": { "count": 5, "min": 46.5, "max": 81.4, "avg": 63.8, "sum": 319, "sum_of_squares": 21095.46, "variance": 148.65199999999967, "std_deviation": 12.19229264740638, "std_deviation_bounds": { "upper": 88.18458529481276, "lower": 39.41541470518724 } } } }
Value Count Aggregation
文檔數量聚合官網文檔:Value Count Aggregation
Value Count Aggregation 可按字段統計文檔數量。
##Value Count Aggregation GET books/_search { "size": 0, "aggs": { "doc_count": { "value_count": { "field": "author" } } } }
返回結果如下:
{ "took": 6, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "doc_count": { "value": 5 } } }
註意:
text 類型的字段不能做排序和聚合(terms Aggregation 除外),如下對 title 字段做聚合,title 定義為 text:
GET books/_search { "size": 0, "aggs": { "doc_count": { "value_count": { "field": "title" } } } }
返回結果如下:
{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead." } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "books", "node": "6n3douACShiPmlA9j2soBw", "reason": { "type": "illegal_argument_exception", "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead." } } ] }, "status": 400 }
2.2 桶聚合
桶聚合官網文檔:Bucket Aggregations
桶聚合包括如下聚合:
- Adjacency Matrix Aggregation
- Children Aggregation
- Composite Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Diversified Sampler Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IP Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Significant Text Aggregation
- Terms Aggregation
Bucket 可以理解為一個桶,它會遍歷文檔中的內容,凡是符合某一要求的就放入一個桶中,分桶相當與 SQL 中 SQL 中的 group by。
terms Aggregation 用於分組聚合,統計屬於各編程語言的書籍數量,如下:
GET books/_search { "size": 0, "aggs": { "terms_count": { "terms": { "field": "language" } } } }
返回結果如下:
{ "took": 31, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "terms_count": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "java", "doc_count": 2 }, { "key": "python", "doc_count": 2 }, { "key": "javascript", "doc_count": 1 } ] } } }
在 terms 分桶的基礎上,還可以對每個桶進行指標聚合。例如,想統計每一類圖書的平局價格,可以先按照 language 字段進行 Terms Aggregation,再進行 Avg Aggregattion,查詢語句如下:
GET books/_search { "size": 0, "aggs": { "terms_count": { "terms": { "field": "language" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }
返回結果如下:
{ "took": 8, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "terms_count": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "java", "doc_count": 2, "avg_price": { "value": 58.35 } }, { "key": "python", "doc_count": 2, "avg_price": { "value": 67.95 } }, { "key": "javascript", "doc_count": 1, "avg_price": { "value": 66.4 } } ] } } }
Range Aggregation
Range Aggregation 是範圍聚合,用於反映數據的分布情況。比如,對 books 索引中的圖書按照價格區間在 0~50、50~80、80 以上進行範圍聚合,如下:
GET books/_search { "size": 0, "aggs": { "price_range": { "range": { "field": "price", "ranges": [ {"to": 50}, {"from": 50, "to": 80}, {"from": 80} ] } } } }
返回結果如下:
{ "took": 16, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "price_range": { "buckets": [ { "key": "*-50.0", "to": 50, "doc_count": 1 }, { "key": "50.0-80.0", "from": 50, "to": 80, "doc_count": 3 }, { "key": "80.0-*", "from": 80, "doc_count": 1 } ] } } }
Range Aggregation 不僅可以對數值型字段進行範圍統計,也可以作用在日期類型上。Date Range Aggregation 專門用於日期類型的範圍聚合,和 Range Aggregation 的區別在於日期的起止值可以使用數學表達式。
2.3 管道聚合
管道聚合官網文檔:Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Cumulative Sum Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Bucket Sort Aggregation
- Serial Differencing Aggregation
Pipeline Aggregations 處理的對象是其他聚合的輸出(而不是文檔)。
2.4 矩陣聚合
矩陣聚合官網文檔:Matrix Aggregations
- Matrix Stats
Matrix Stats 聚合是一種面向數值型的聚合,用於計算一組文檔字段中的以下統計信息:
計數:計算過程中每種字段的樣本數量;
平均值:每個字段數據的平均值;
方差:每個字段樣本數據偏離平均值的程度;
偏度:量化每個字段樣本數據在平均值附近的非對稱分布情況;
峰度:量化每個字段樣本數據分布的形狀;
協方差:一種量化描述一個字段數據隨另一個字段數據變化程度的矩陣;
相關性:描述兩個字段數據之間的分布關系,其協方差矩陣取值在[-1,1]之間。
主要用於計算兩個數值型字段之間的關系。如對日誌記錄長度和 HTTP 狀態碼之間關系的計算。
GET /_search { "aggs": { "statistics": { "matrix_stats": { "fields": ["log_size", "status_code"] } } } }
Elastic Stack 筆記(七)Elasticsearch5.6 聚合分析