1. 程式人生 > >ElasticSearch基礎入門學習筆記

ElasticSearch基礎入門學習筆記

前言

本筆記的內容主要是在從0開始學習ElasticSearch中,按照官方文件以及自己的一些測試的過程。

安裝

由於是初學者,按照官方文件安裝即可。前面ELK入門使用主要就是講述了安裝過程,這裡不再贅述。

學習教程

找了很久,文件大多比較老。即使是官方文件也是基於2.x介紹的,官網最新已經演進到6了。不過基礎入門還是可以的。接下來將參照官方文件來學習。

安裝好ElasticSearch和Kibana之後. 開啟localhost:5601, 選擇Dev Tools。

索引(儲存)僱員文件

測試的資料來源是公司僱員的資訊列表。其中,每個僱員的資訊叫做一個文件,新增一條資訊叫做索引一個文件。

在console裡輸入

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
  • megacorp 是索引名稱
  • employee 是型別名稱
  • 1 是id,同樣是僱員的id

游標定位到第一行,點選綠色按鈕執行。

這個是簡化的存入快捷方式, 其本質還是通過ES提供的REST API來實現的。上述可以用postman或者curl來實現,域名為ES的地址,即localhost:9200。對於postman,get方法不允許傳body,用post也可以。

這樣就將一個文件存入了ES。接下來,多儲存幾個

PUT /megacorp/employee/2
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

PUT /megacorp/employee/3
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}

然後,我們可以去檢視,點選Management,Index Patterns,Configure an index pattern, 輸入megacorp,確定。

點選Discover, 就可以看到我們儲存的資訊了。

檢索文件

存入資料後,想要查詢出來。查詢id為1的員工。

GET /megacorp/employee/1

返回:
{
  "_index": "megacorp",
  "_type": "employee",
  "_id": "1",
  "_version": 5,
  "found": true,
  "_source": {
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": [
      "sports",
      "music"
    ]
  }
}

區別於儲存一條記錄,只是http method不同。

  • put 新增
  • get 獲取
  • delete 刪除
  • head 查詢是否存在
  • 想要更新,再次put即可

輕量搜尋

我們除了findById,最常見就是條件查詢了。

先來檢視所有:

GET /megacorp/employee/_search

對了,可以檢視記錄個數count

GET /megacorp/employee/_count

想要檢視last_name是Smith的

GET /megacorp/employee/_search?q=last_name:Smith

加一個引數q,欄位名:Value的形式查詢。

查詢表示式

Query-string 搜尋通過命令非常方便地進行臨時性的即席搜尋 ,但它有自身的侷限性(參見 輕量 搜尋 )。Elasticsearch 提供一個豐富靈活的查詢語言叫做 查詢表示式 , 它支援構建更加複雜和健壯的查詢。

領域特定語言 (DSL), 指定了使用一個 JSON 請求。我們可以像這樣重寫之前的查詢所有 Smith 的搜尋

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

更復雜的查詢

繼續修改上一步的查詢

GET /megacorp/employee/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            }
        }
    }
}

多了一個range過濾,要求age大於30.

結果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

全文檢索

截止目前的搜尋相對都很簡單:單個姓名,通過年齡過濾。現在嘗試下稍微高階點兒的全文搜尋--一項傳統資料庫確實很難搞定的任務。

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

結果

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.53484553,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.53484553,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.26742277,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

有個排序,以及是分數_score。可以看到只有一個字母匹配到的也查出來了. 如果我們想完全匹配, 換一個種查詢.

match_phrase 會完全匹配短語.

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

我們百度搜索的時候, 命中的關鍵字還會高亮, es也可以返回高亮的位置.

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

返回

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        },
        "highlight": {
          "about": [
            "I love to go <em>rock</em> <em>climbing</em>"
          ]
        }
      }
    ]
  }
}

聚合計算Group by

在sql裡經常遇到統計的計算, 比如sum, count, avg. es可以這樣:

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

aggs表示聚合, all_interests是返回的變數名稱, terms 表示count計算. 這個語句的意思是, 對interests進行count統計. 然後, es可能會返回:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "megacorp",
        "node": "iqHCjOUkSsWM2Hv6jT-xUQ",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    }
  },
  "status": 400
}

意思是,對字元的統計, 需要開啟一個設定fielddata=true.

這就需要修改index設定了, 相當於修改關係型資料庫表結構.

修改index mapping

我們先來檢視一個配置:

GET /megacorp/employee/_mapping

結果:

{
  "megacorp": {
    "mappings": {
      "employee": {
        "properties": {
          "about": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "first_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "interests": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "last_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

簡單可以看出是定義了各個欄位型別. 上個問題是需要增加一個配置

"fielddata": true

更新方法如下:


PUT /megacorp/employee/_mapping
{
        "properties": {
          "about": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "first_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "interests": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            },
            "fielddata": true
          },
          "last_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }

返回:

{
  "acknowledged": true
}

表示更新成功了. 然後可以繼續我們之前的聚合計算了.

聚合計算 group by count

對於sql類似於

select interests, count(*) from index_xxx
where last_name = 'smith'
group by interests.

在es裡可以這樣查詢:

GET /megacorp/employee/_search
{
  "_source": false,
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
    "size": 0,
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

_source=false 是為了不返回hit命中的item的屬性, 預設true.

"size": 0,表示不返回hits. 預設會返回所有的行, 我們不需要, 我們只要返回統計結果.

aggs表示一個聚合操作.

all_interests是自定義的一個變數名稱, 可以隨便寫一個.

terms 表示進行count操作, 對應的欄位是interests.

返回:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "all_interests": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "music",
          "doc_count": 2
        },
        {
          "key": "sports",
          "doc_count": 1
        }
      ]
    }
  }
}

可以得到需要的欄位的count. 同樣可以計算sum, avg.



GET /megacorp/employee/_search
{
    "_source": false, 
    "size": 0, 
    "aggs" : {
        "avg_age" : {
            "avg" : { "field" : "age" }
        },
        "sum_age" : {
            "sum" : { "field" : "age" }
        }
    }
}

返回

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "avg_age": {
      "value": 30.666666666666668
    },
    "sum_age": {
      "value": 92
    }
  }
}

總結

上述是官方文件的第一節, 基礎入門. 這裡只是摘抄和實現了一遍. 沒做更多的突破,但增加了個人理解. 可以知道es基本怎麼用了. 更多更詳細的語法後面慢慢來.

參考

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/_search_with_query_dsl.html