1,Index time analysis. 建立或者更新文件時,會對文件進行分詞
2,Search time analysis. 查詢時,對查詢語句分詞

- 查詢時通過analyzer指定分詞器

GET test_index/_search
  "query": {
    "match": {
      "name": {
        "query": "lin",
        "analyzer": "standard"

- 建立index mapping時指定search_analyzer

PUT test2
"mappings": { "properties": { "title":{ "type": "text", "analyzer": "whitespace", "search_analyzer": "standard" } } } }
# 不指定分詞時,會使用預設的standard


  • 明確欄位是否需要分詞,不需要分詞的欄位將type設定為keyword,可以節省空間和提高寫效能。

_analyzer api

GET _analyze
  "analyzer": "standard",
  "text": "this is a test"
# 可以檢視text的內容使用standard分詞後的結果
  "tokens" : [
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
"token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 3 } ] }


PUT test3
  "settings": {
    "analysis": {   
      "analyzer": {     
  "mappings": {
    "properties": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
              "type": "text",
              "analyzer": "my_analyzer"


POST test3/_analyze
  "field": "my_text",
  "text": ["The test message."]

  "tokens" : [
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
      "token" : "test",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
      "token" : "message",
      "start_offset" : 9,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2

POST test3/_analyze
  "field": "my_text.english", 
  "text": ["The test message."]
  "tokens" : [
      "token" : "test",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
      "token" : "message",
      "start_offset" : 9,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2


  • standard 由以下組成
    • tokenizer:Standard Tokenizer
    • token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
      analyzer API測試 :
      POST _analyze
        "analyzer": "standard",
        "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."


  "tokens" : [
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10


  • whitespace 空格為分隔符
    POST _analyze
      "analyzer": "whitespace",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    -->  [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]
  • simple
    POST _analyze
      "analyzer": "simple",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    ---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

  • stop 預設stopwords用_english_
    POST _analyze
      "analyzer": "stop",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    -->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
    # stopwords
    # stopwords_path
  • keyword 不分詞的
    POST _analyze
      "analyzer": "keyword",
      "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
    得到  "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一條完整的語句


es內建很多分詞器,但是對中文分詞並不友好,例如使用standard分詞器對一句中文話進行分詞,會分成一個字一個字的。這時可以使用第三方的Analyzer外掛,比如 ik、pinyin等。這裡以ik為例


# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
# /etc/init.d/elasticsearch restart


GET _analyze
  "analyzer": "ik_max_word",
  "text": "你好嗎?我有一句話要對你說呀。"

GET _analyze
  "analyzer": "ik_smart",
  "text": "你好嗎?我有一句話要對你說呀。"


還可以用內建的 character filter, tokenizer, token filter 組裝一個analyzer(custom analyzer)

  • custom 定製analyzer,由以下幾部分組成
    • 0個或多個e character filters
    • 1個tokenizer
    • 0個或多個 token filters


PUT t_index
  "settings": {
    "analysis": {
      "analyzer": {
POST t_index/_analyze
  "analyzer": "my_analyzer",
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's <b> bone.</b>"]



PUT test_index
  "settings": {
    "analysis": {    # 分詞設定,可以自定義
      "char_filter": {},   #char_filter  關鍵字
      "tokenizer": {},    #tokenizer 關鍵字
      "filter": {},     #filter  關鍵字
      "analyzer": {}    #analyzer 關鍵字

character filter 在tokenizer之前對原始文字進行處理,比如增加,刪除,替換字元等


  • html strip 除去html標籤和轉換html實體
    • 引數:escaped_tags不刪除的標籤


POST _analyze
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": ["<p>I&apos;m so <b>happy</b>!</p>"]
      "token": """

I'm so happy!

PUT t_index
  "settings": {
    "analysis": {
      "analyzer": {  #關鍵字
        "my_analyzer":{   #自定義analyzer
      "char_filter": {  #關鍵字
        "my_char_filter":{   #自定義char_filter
          "escaped_tags":["b"]  #不從文字中刪除的HTML標記陣列
POST t_index/_analyze
  "analyzer": "my_analyzer",
  "text": ["<p>I&apos;m so <b>happy</b>!</p>"]
      "token": """

I'm so <b>happy</b>!


  • mapping 對映型別,以下引數必須二選一
    • mappings 指定一組對映,每個對映格式為 key=>value
    • mappings_path 絕對路徑或者相對於config路徑 key=>value
  • PUT t_index
      "settings": {
        "analysis": {
          "analyzer": {     #關鍵字
            "my_analyzer":{   #自定義分詞器
          "char_filter": {    #關鍵字
            "my_char_filter":{ #自定義char_filter
              "mappings":[       #指明對映關係
    POST t_index/_analyze
      "analyzer": "my_analyzer",
      "text": ["i am so :)"]
    得到 [i,am,so,happy]
  • pattern replace
    • pattern引數 正則
    • replacement 替換字串 可以使用$1..$9
    • flags 正則標誌

tokenizer 將原始文件按照一定規則切分為單詞

  • standard
    • 引數:max_token_length,最大token長度,預設是255


PUT t_index
  "settings": {
    "analysis": {
      "analyzer": {
      "tokenizer": { 
POST t_index/_analyze
  "analyzer": "my_analyzer",
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
得到   [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]
# jumped 長度為6  在5這個位置被分割

  • letter 非字母時分成多個terms


POST _analyze
  "tokenizer": "letter",
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
得到 [ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

  • lowcase 跟letter tokenizer一樣 ,同時將字母轉化成小寫


POST _analyze
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
得到  [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

  • whitespace 按照空白字元分成多個terms
    • 引數:max_token_length
POST _analyze
  "tokenizer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
得到 [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

  • keyword 空操作,輸出完全相同的文字
    • 引數:buffer_size,單詞一個term讀入緩衝區的長度,預設256
POST _analyze
  "tokenizer": "keyword",
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
得到"token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一個完整的文字

token filter 針對tokenizer 輸出的單詞進行增刪改等操作

  • lowercase 將輸出的單詞轉化成小寫
POST _analyze
  "filter": ["lowercase"],
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's  bone"]
"token": "the 2 quick brown-foxes jumped over the lazy dog's  bone"

PUT t_index
  "settings": {
    "analysis": {
      "analyzer": {
POST t_index/_analyze
  "analyzer": "my_analyzer",
    "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's  bone"]

  • stop 從token流中刪除stop words 。
    # stopwords   要使用的stopwords, 預設_english_
    # stopwords_path
    # ignore_case   設定為true則為小寫,預設false
    # remove_trailing
    PUT t_index
      "settings": {
        "analysis": {
          "analyzer": {
          "filter": {
    POST t_index/_analyze
      "analyzer": "my_analyzer",
      "text": ["lucky and happy not sad"]