ELK之Elasticsearch介紹及在搜尋中的使用

ELK: Elasticsearch + Logstash + Kibana

Elasticsearch: 是一個分散式的、實時全文搜尋及分析引擎；檢索效能高效是最大的特色。

Logstash: 是一個數據收集器，有豐富的外掛(input/filter/output)。

Kibana: 是一個基於Elasticsearch的web展示平臺。

Elasticsearch的基本概念

近實時(Near Realtime): 從一個doc被索引到能夠查詢到，大約有1秒的間隔

index: 索引，類似SQL中的database, NoSQL中的database

type: 一個index下，可以有多個type, 類似SQL中的table, NoSQL中的collection

document: 某index/type下的一條資料，以json的格式。類似SQL中的row，NoSQL中的document.

MySQL	database	table	row
mongodb	database	collection	document
ES	index	type	document

shard: 一個index可以有多個shards，預設是5。目的是，方便水平擴容和提高負載。一旦index被建立，shards的數量不能再改。每個shard，就是一個Lucene的index。一個Lucene的index能夠容納的documents的數量上限是2,147,483,519 (= Integer.MAX_VALUE - 128)

replica: 一個index的備份數，預設是1。目的是，提高可靠性，加快search的速度。replicas的數量可以隨時修改。e.g. http://10.16.25.16:9000/_plugin/head/

node:

cluster: one or more nodes with same cluster name

Schemaless or not

在某種意義上，ES是schemaless的，在索引doc的時候，直接指定index/type就可以了，無需對index/type進行任何的設定。實際上，在index/type被建立的時候，ES會去猜json裡面的欄位，然後自動生成一份mapping(mapping是對doc中，各欄位的型別的定義，及索引方式等的解析)。

一旦有了mapping，就變成schema的了，doc裡面已有的欄位型別就不能隨意更改了。

template and mapping

mapping:用來說明doc裡面的各個欄位的型別，以及如何儲存和索引。mapping依賴於index, 不同的type有不同的mapping。mapping中的欄位型別，一旦建立，不能修改，但可以新增欄位。

template:在index被建立的時候，提供index的setting以及需要的mapping，先於index存在，且只在index 被建立的時候生效，一旦索引被建立，修改template就不會對已經建立的index生效了。

ES的三個基本問題之寫入資料

ES使用了2個埠9200和9300(預設)

9200負責HTTP的請求

e.g. curl -XPUT http://esnode:9200/index/type -d ‘{json_doc}’

9300是TCP埠，供ES nodes之間通訊使用。

寫資料到ES，所使用的clients分成2種，

第一是JAVA的，第二是其它。

先說其它的（python/go/...等)，都是用的是http REST api的方式，即訪問的是9200.

再說JAVA client，根據角色的不同，用分成2種，

其一，節點client(node client)即，java client成為ES cluster中的一個節點，但不儲存資料。

其二，傳輸client(transport client),即java client不加入ES cluster, 只是傳輸資料給ES cluster中的節點。JAVA client使用的都是9300埠，使用的是Elasticsearch的傳輸協議（native Elasticsearch transport protocol）。

Java 的客戶端的版本號必須要與 Elasticsearch 節點所用的版本號一樣，不然他們之間可能無法識別

ES的三個基本問題之讀取資料

通過doc ID直接獲取

e.g. curl -XGET xxxx/yyyy/AVQvYyK6aK8LxcWQ324f

and the return

{"_index":"xxxx”,"_type":"yyyy","_id":"AVQvYyK6aK8LxcWQ324f”,

"_version":1,"found":true,

"_source”:{json_doc}

}

通過_search獲取,size default 10

GET /index/type/_search?q=last_name:Smith

GET /index/type/_search

{

"query" : {

"match" : {

"last_name" : "Smith"

}

通過_scroll來獲取大量資料，類似database的遊標

一般用於reindex

ES的三個基本問題之效能優化

CPU/記憶體/SSD磁碟

除了官網上介紹的，ES_HEAP_SIZE/swap off/bootstrap.mlockall等

translog.durability:”async”，這個對寫的效能影響巨大,

預設是request,即每當有增刪改操作時，就會觸發flush/commit to Lucene(磁碟IO)

ES的欄位型別及index

螢幕快照 2016-06-01 下午4.02.58.png

若沒有mapping， ES會根據上面的規則去guess

若有mapping,ES會根據mapping對欄位進行匹配及轉換

e.g. 某欄位在mapping中是number,而doc裡面是 “123”，則會被自動轉成123；“cde”則會報錯

ES中，不同型別的欄位， ES的處理方式是不同的。對於非string欄位， ES會原樣索引；對於string型別的欄位，index型別有：no/not_analyzed/analyzed

no	not_analyzed	analyzed(預設)
不索引，該欄位無法search	不分析，原樣索引	只對string欄位，先分析（分詞），然後索引

analyzer: 對於index:analyzed的string型別的欄位，使用哪種analyzer(when index and also search), 預設是standard，對於中文，會切成單字，搜尋中會使用IK

DSL介紹

1) match_all

{ "match_all": {}} 匹配所有的，當不給查詢條件時，預設。

2) match

進行full text search或者exact value(非string欄位或not_analyzed的欄位)，進行匹配

3) multi_match

同時對多個欄位進行同樣的match

{

"multi_match": {

"query": "full text search",

"fields": [ "title", "body" ]

}

4) range

對number或時間欄位進行

{

"range": {

"age": {

"gte": 20,

"lt": 30

}

5) term

對欄位進行確切值(exact value)的查詢，如數字、時間、bool、not_analyzed欄位等。

{ "term": { "age": 26 }}

{ "term": { "date": "2014-09-01" }}

{ "term": { "public": true }}

{ "term": { "tag": "full_text" }}

6) terms

和term一樣，不同的是，可以指定多個值來進行精確匹配

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

7) exists/missing

用來查詢某個欄位是否有值，類似SQL中的 not is_null/is_null

{

"exists": {

"field": "title"

}

8) bool 用來連線一系列的查詢子句：包括must/must_not/filter/should

{

"bool" : {

"must" : {

"term" : { "user" : "kimchy" }

"filter": {

"term" : { "tag" : "tech" }

"must_not" : {

"range" : {

"age" : { "from" : 10, "to" : 20 }

}

"should" : [

{

"term" : { "tag" : "wow" }

{

"term" : { "tag" : "elasticsearch" }

}

]

}

query VS filter

除了需要匹配程度的查詢(有_score的情況)使用query，其餘的查詢都應該使用filter。（As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.）

filter的結果是會被ES快取的，以此來提高效率。

另外， filter由於不計算分數及排序，所以，速度較 query要快。

GET _search

{

"query": {

"bool": {

"must": [

{ "match": { "title": "Search" }},

{ "match": { "content": "Elasticsearch" }}

"filter": [

{ "term": { "status": "published" }},

{ "range": { "publish_date": { "gte": "2015-01-01" }}}

]

}

script

ES支援使用script。(script fields, script score)

groovy default script language

script預設是被禁止的，需要在config/elasticsearch.yml中開啟

script.inline: true

script.indexed: true

script有4種：inline/file/indexed/plugin

1) inline

GET /_search{"script_fields": {"my_field": {"script": {"inline": "1 + my_var","params": {"my_var": 2 } } } }}

2) file

存放的位置：config/scripts/my_script.groovy

GET /_search

{ "script_fields":

{ "my_field": {

"script": {

"file": "my_script",

"params": { "my_var": 2 }

}

3) indexed

將指令碼儲存在ES內， index as .script, 然後通過ID訪問。

curl -XPOST localhost:9200/_scripts/groovy/indexedCalculateScore -d '{"script":"log(_score * 2) + my_modifier"}'

{ "script_score":{ "script":{ "id": "indexedCalculateScore", "lang" : "groovy", "params":{ "my_modifier":8}}}}

4) plugin

需要install到ES

{"script_score":{"script":{"inline":"my_script","lang":"native"}}}

ssh://git.wandoulabs.com:29418/es-search

es-search/es-score下有MyNativeScript*.java是個示例，比較方便。

function_score

doc and source:

使用source可以訪問doc的原值，但速度較慢(not loaded into memory), 訪問時需要經歷loaded -> parsed的過程。

doc可以訪問not_analyzed的欄位的值，e.g. doc[‘field_name’].value

速度的優化措施：

0) 使用doc，不使用source，使用source慢得離譜了。因為doc已經在記憶體裡了

1) 增加shards數及ES node數，結果不是很明顯

2) 使用size, 發現只是展示的doc的數量變化，對於速度無影響

3) terminate_after: 在每個shards上，只search前N個doc，不實際，無法保證結果。

4) 增加filter，使得進行二次打分的doc數量極大的降低，來提高速度。

IK分詞

對於string欄位，預設使用的analyzer是standard,對中文會分成單字，效果不好。

對於title等欄位，增加子欄位， fields: title.cn, 使用"analyzer": "ik"進行分析。

"mappings":{"my_type":{"properties":{"text":{"type":"string","fields"

ELK之Elasticsearch介紹及在搜尋中的使用

ELK: Elasticsearch + Logstash + Kibana

Elasticsearch的基本概念

Schemaless or not

template and mapping

ES的三個基本問題之寫入資料

ES的三個基本問題之讀取資料

通過doc ID直接獲取

通過_search獲取,size default 10

通過_scroll來獲取大量資料，類似database的遊標

ES的三個基本問題之效能優化

ES的欄位型別及index

DSL介紹