Elasticsearch使用總結

阿新 • • 發佈：2018-12-22

最初接觸Elasticsearch是在ELK日誌系統的建設中，隨著對日誌資料的消費越來越多，被其強大的資料搜尋和分析能力所吸引；後來，在使用者行為資料採集系統中，使用Elasticsearch做核心資料儲存和實時聚合分析；再後來，使用Elasticsearch搭建了產品的搜尋服務。目前來看，Elasticsearch在這三個系統中表現都很靈活和優異，沒有讓我們失望，而在系統建設中，我們也遇到過不少問題，有基本概念的迷惑、操作方法、部署、效能等等各個方面。本文著重對Elasticsearch在應用層面上的使用進行總結，搞清楚WHAT和HOW兩個層面，即是什麼、怎麼用。
NOTE：本文所述的概念和方法均在

Elasticsearch2.3版本下。

基本概念

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

這是官方對Elasticsearch的定位。通俗的講，Elasticsearch就是一款面向文件的NoSQL資料庫，使用JSON作為文件序列化格式。但是，它的高階之處在於，使用Lucene作為核心來實現所有索引和搜尋的功能，使得每個文件的內容都可以被索引、搜尋、排序、過濾。同時，提供了豐富的聚合功能，可以對資料進行多維度分析。對外統一使用REST API介面進行溝通，即Client與Server之間使用HTTP協議通訊。
先來看儲存上的基本概念，這裡將其與MySQL進行了對比，從而可以更清晰的搞清楚每個概念的意義。

Elasticsearch	MySQL
index（索引，名詞）	database
doc type（文件型別）	table
document（文件）	row
field（欄位）	column
mapping（對映）	schema
query DSL（查詢語言）	SQL

再來看倒排索引的概念（官方解釋）。倒排索引是搜尋引擎的基石，也是Elasticsearch能實現快速全文搜尋的根本。歸納起來，主要是對一個文件內容做兩步操作：分詞、建立“單詞-文件”列表。舉個例子，假如有下面兩個文件：

1. {"content": "The quick brown fox jumped over the lazy dog"}
2. {"content": "Quick brown foxes leap over lazy dogs in summer"}

Elasticsearch會使用分詞器對content欄位的內容進行分詞，再根據單詞在文件中是否出現建立如下所示的列表，√表示單詞在文件中有出現。假如我們想搜尋“quick brown”，只需要找到每個詞在哪個文件中出現即可。如果有多個文件匹配，可以根據匹配的程度進行打分，找出相關性高的文件。

Term	Doc_1	Doc_2
Quick	√
The	√
brown	√	√
dog	√
dogs	√
fox	√
foxes	√
in	√
jumped	√
lazy	√	√
leap	√
over	√	√
quick	√
summer	√
the	√

最後，我們再回過頭看看上面的對映的概念。類似於MySQL在db schema中申明每個列的資料型別、索引型別等，Elasticsearch中使用mapping來做這件事。常用的是，在mapping中申明欄位的資料型別、是否建立倒排索引、建立倒排索引時使用什麼分詞器。預設情況下，Elasticsearch會為所有的string型別資料使用standard分詞器建立倒排索引。

檢視mapping：GET http://localhost:9200/<index name>/_mapping
NOTE: 這裡的index是blog，doc type是test
{
	"blog": {
		"mappings": {
			"test": {
				"properties": {
					"activity_type": {
						"type": "string",
						"index": "not_analyzed"
					},
					"address": {
						"type": "string",
						"analyzer": "ik_smart"
					},
					"happy_party_id": {
						"type": "integer"
					},
					"last_update_time": {
						"type": "date",
						"format": "yyyy-MM-dd HH:mm:ss"
					}
				}
			}
		}
	}
}

資料插入

在MySQL中，我們需要先建立database和table，申明db schema後才可以插入資料。而在Elasticsearch，可以直接插入資料，系統會自動建立缺失的index和doc type，並對欄位建立mapping。因為半結構化資料的資料結構通常是動態變化的，我們無法預知某個文件中究竟有哪些欄位，如果每次插入資料都需要提前建立index、type、mapping，那就失去了其作為NoSQL的優勢了。

直接插入資料：POST http://localhost:9200/blog/test
{
	"count": 5,
	"desc": "hello world"
}

檢視索引：GET http://localhost:9200/blog/_mapping
{
	"blog": {
		"mappings": {
			"test": {
				"properties": {
					"count": {
						"type": "long"
					},
					"desc": {
						"type": "string"
					}
				}
			}
		}
	}
}

然而這種靈活性是有限，比如上文我們提到，預設情況下，Elasticsearch會為所有的string型別資料使用standard分詞器建立倒排索引，那麼如果某些欄位不想建立倒排索引怎麼辦。Elasticsearch提供了dynamic template的概念來針對一組index設定預設mapping，只要index的名稱匹配了，就會使用該template設定的mapping進行欄位對映。
下面所示即建立一個名稱為blog的template，該template會自動匹配以"blog_"開頭的index，為其自動建立mapping，對文件中的所有string會自動增加一個.raw欄位，並且該欄位不做索引。這也是ELK中的做法，可以檢視ELK系統中Elasticsearch的template，會發現有一個名為logstash的template。

建立template：POST http://localhost:9200/_template/blog
{
	"template": "blog_*",
	"mappings": {
		"_default_": {
			"dynamic_templates": [{
				"string_fields": {
					"mapping": {
						"type": "string",
						"fields": {
							"raw": {
								"index": "not_analyzed",
								"ignore_above": 256,
								"type": "string"
							}
						}
					},
					"match_mapping_type": "string"
				}
			}],
			"properties": {
				"timestamp": {
					"doc_values": true,
					"type": "date"
				}
			},
			"_all": {
				"enabled": false
			}
		}
	}
}

直接插入資料：POST http://localhost:9200/blog_2016-12-25/test
{
	"count": 5,
	"desc": "hello world"
}

插入問題還有個話題就是批量插入。Elasticsearch提供了bulk API用來做批量的操作，你可以在該API中自由組合你要做的操作和資料，一次性發送給Elasticsearch進行處理，其格式是這樣的。

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

比如：
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

如果是針對相同的index和doc type進行操作，則在REST API中指定index和type即可。批量插入的操作舉例如下：

批量插入：POST http://localhost:9200/blog_2016-12-24/test/_bulk
{"index": {}}
{"count": 5, "desc": "hello world 111"}
{"index": {}}
{"count": 6, "desc": "hello world 222"}
{"index": {}}
{"count": 7, "desc": "hello world 333"}
{"index": {}}
{"count": 8, "desc": "hello world 444"}

檢視插入的結果：GET http://localhost:9200/blog_2016-12-24/test/_search

資料查詢

Elasticsearch的查詢語法（query DSL）分為兩部分：query和filter，區別在於查詢的結果是要完全匹配還是相關性匹配。filter查詢考慮的是“文件中的欄位值是否等於給定值”，答案在“是”與“否”中；而query查詢考慮的是“文件中的欄位值與給定值的匹配程度如何”，會計算出每份文件與給定值的相關性分數，用這個分數對匹配了的文件進行相關性排序。
在實際使用中，要注意兩點：第一，filter查詢要在沒有做倒排索引的欄位上做，即上面mapping中增加的.raw欄位；第二，通常使用filter來縮小查詢範圍，使用query進行搜尋，即二者配合使用。舉例來看，注意看三個不同查詢在寫法上的區別：

1. 只使用query進行查詢：
POST http://localhost:9200/user_action/_search
查詢的結果是page_name欄位中包含了wechat所有文件
這裡使用size來指定返回文件的數量，預設Elasticsearch是返回前100條資料的
{
	"query": {
		"bool": {
			"must": [{
				"match": {
					"page_name": "wechat"
				}
			},
			{
				"range": {
					"timestamp": {
						"gte": 1481218631,
						"lte": 1481258231,
						"format": "epoch_second"
					}
				}
			}]
		}
	},
	"size": 2
}

2. 只使用filter進行查詢：
POST http://localhost:9200/user_action/_search
查詢的結果是page_name欄位值等於"example.cn/wechat/view.html"的所有文件
{
	"filter": {
		"bool": {
			"must": [{
				"term": {
					"page_name.raw": "example.cn/wechat/view.html"
				}
			},
			{
				"range": {
					"timestamp": {
						"gte": 1481218631,
						"lte": 1481258231,
						"format": "epoch_second"
					}
				}
			}]
		}
	},
	"size": 2
}

3. 同時使用query與filter進行查詢：
POST http://localhost:9200/user_action/_search
查詢的結果是page_name欄位值等於"example.cn/wechat/view.html"的所有文件
{
	"query": {
		"bool": {
			"filter": [{
				"bool": {
					"must": [{
						"term": {
							"page_name.raw": "job.gikoo.cn/wechat/view.html"
						}
					},
					{
						"range": {
							"timestamp": {
								"gte": 1481218631,
								"lte": 1481258231,
								"format": "epoch_second"
							}
						}
					}]
				}
			}]
		}
	},
	"size": 2
}

聚合分析

類似於MySQL中的聚合由分組和聚合計算組成，Elasticsearch的聚合也有兩部分：Buckets與Metrics。Buckets，作為名詞的話你可以理解為“桶”，意思是說聚合後的資料放在一個個桶裡，也就是將資料分成多個組的意思，一個桶就是一個組別；作為動詞，就是SQL中的分組group by。Metrics，可以理解為“度量”，就是對桶裡的資料進行某個運算，相當於SQL中呼叫聚合函式COUNT，SUM，MAX等。另外，聚合分析有時需要對多個欄位值進行分組，在MySQL中，我們只要使用“group by c1, c2, c3”就可以完成這樣的功能，但是Elasticsearch沒有這樣的語法。Elasticsearch提供了另一種方法，即Buckets巢狀，仔細想想，似乎這種設計更加符合人的思維方式。
舉例來看具體操作方法：

1. 最簡單的聚合查詢
POST http://localhost:9200/user_action/_search
為了簡單，這裡刪除了query的條件描述
將符合條件的文件按照公司進行聚合
這裡有兩個size，和aggs並列的size=0表示返回結果不包含查詢結果，只返回聚合結果，terms裡面的size表示返回的聚合結果數量
{
	"aggs": {
		"company_terms": {
			"terms": {
				"field": "company",
				"size": 2
			}
		}
    },
    "size": 0
}

2. Buckets與Metric配合
POST http://localhost:9200/user_action/_search
將符合條件的文件按照公司進行聚合，並獲取每個公司最近一次操作的時間
{
	"aggs": {
		"company_terms": {
			"terms": {
				"field": "company",
				"size": 2
			},
			"aggs": {
				"latest_record": {
					"max": {
						"field": "timestamp"
					}
				}
			}
		}
	},
	"size": 0
}

3. Buckets巢狀
POST http://localhost:9200/user_action/_search
將符合條件的文件先按照公司進行聚合，再對每個公司下的門店進行聚合，並獲取每個門店最近一次操作的時間
{
	"aggs": {
		"company_terms": {
			"terms": {
				"field": "company",
				"size": 1
			},
			"aggs": {
				"store_terms": {
					"terms": {
						"field": "store",
						"size": 2
					},
					"aggs": {
						"latest_record": {
							"max": {
								"field": "timestamp"
							}
						}
					}
				}
			}
		}
	},
	"size": 0
}

在上面的例子中，我們使用了terms來做buckets聚合，意思是按照某個欄位的值來做分組，注意這裡的欄位是不能做過索引的，即需要使用上面所示的.raw欄位或者非string的欄位。Elasticsearch提供了更多有意思的buckets聚合方法，比如常用的date_histogram，可以對時間欄位按照某個間隔進行聚合，找出每個時間段內的文件數量，這對我們做基於時間序列的統計是非常有幫助的，舉例如下：

POST http://localhost:9200/user_action/_search
{
    "aggs": {
        "time_histogram": {
            "date_histogram": {
                "field": "timestamp",
                "interval": "day"
            }
        }
    },
    "size": 0
}

Elasticsearch使用總結

基本概念

資料插入

資料查詢

聚合分析

elasticsearch中的幾個概念總結

ElasticSearch命令增加字段總結

Elasticsearch學習總結

elasticsearch簡單JavaAPI總結

總結遇到的elasticsearch啟動失敗的幾種情況及解決

ElasticSearch分詞器總結

yarn/zookeeper/solr/elasticsearch概況總結

ElasticSearch最佳入門實踐（二十七）總結以及什麼是distributed document store

elasticsearch的TF/IDF打分公式總結

Elasticsearch Java Rest Client API 整理總結 (一)

elasticsearch優化總結

elasticsearch系統性能調優總結

Elasticsearch Java Rest Client API 整理總結 (三)——Building Queries

Elasticsearch Query DSL 整理總結（二）—— 要搞懂 Match Query，看這篇就夠了

Elasticsearch Query DSL 整理總結（三）—— Match Phrase Query 和 Match Phrase Prefix Query

centos7下安裝elasticSearch錯誤總結(單節點模式)

貸前系統ElasticSearch實踐總結

elasticsearch面試總結

Elasticsearch知識點總結

關於ElasticSearch中分析器、分詞器等相關知識的總結

Elasticsearch使用總結

基本概念

資料插入

資料查詢

聚合分析

相關推薦