1. 程式人生 > 實用技巧 >ElasticSearch-分詞器analyzer

ElasticSearch-分詞器analyzer

analyzer

分詞器使用的兩個情形:
1,Index time analysis. 建立或者更新文件時,會對文件進行分詞
2,Search time analysis. 查詢時,對查詢語句分詞

指定查詢時使用哪個分詞器的方式有:

  - 查詢時通過analyzer指定分詞器

GET test_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "lin",
        "analyzer": "standard"
      }
    }
  }
}

- 建立index mapping時指定search_analyzer

PUT test_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "whitespace",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

索引時分詞是通過配置 Index mapping中的每個欄位的引數analyzer指定的

# 不指定分詞時,會使用預設的standard
PUT test_index
{
  
"mappings": { "doc": { "properties": { "title":{ "type": "text", "analyzer": "whitespace" #指定分詞器,es內建有多種analyzer } } }}}

注意:

  • 明確欄位是否需要分詞,不需要分詞的欄位將type設定為keyword,可以節省空間和提高寫效能。

_analyzer api

GET _analyze
{
  "analyzer": "standard",
  "text": "this is a test"
}
# 可以檢視text的內容使用standard分詞後的結果
{
  "tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
View Code

設定analyzer

PUT test
{
  "settings": {
    "analysis": {    #自定義分詞器
      "analyzer": {      # 關鍵字
        "my_analyzer":{   # 自定義的分詞器
          "type":"standard",    #分詞器型別standard
          "stopwords":"_english_"   #standard分詞器的引數,預設的stopwords是\_none_
        }
      }
    }
  },
  "mappings": {
    "doc":{
      "properties": {
        "my_text":{
          "type": "text",
          "analyzer": "standard",  # my_text欄位使用standard分詞器
          "fields": {
            "english":{            # my_text.english欄位使用上面自定義得my_analyzer分詞器
              "type": "text", 
              "analyzer": "my_analyzer"
            }}}}}}}
POST test/_analyze
{
  "field": "my_text",    # my_text欄位使用的是standard分詞器
  "text": ["The test message."]
}
-------------->[the,test,message]

POST test/_analyze
{
  "field": "my_text.english",     #my_text.english使用的是my_analyzer分詞器
  "text": ["The test message."]
}
------------>[test,message]

ES內建了很多種analyzer。比如:

  • standard 由以下組成
    • tokenizer:Standard Tokenizer
    • token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
    • analyzer API測試 :
      POST _analyze
      {
        "analyzer": "standard",
        "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
      }

      結果為:

    • {
        "tokens": [
          {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
          },
          {
            "token": "2",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<NUM>",
            "position": 1
          },
          {
            "token": "quick",
            "start_offset": 6,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 2
          },
          {
            "token": "brown",
            "start_offset": 12,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
          },
          {
            "token": "foxes",
            "start_offset": 18,
            "end_offset": 23,
            "type": "<ALPHANUM>",
            "position": 4
          },
          {
            "token": "jumped",
            "start_offset": 24,
            "end_offset": 30,
            "type": "<ALPHANUM>",
            "position": 5
          },
          {
            "token": "over",
            "start_offset": 31,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 6
          },
          {
            "token": "the",
            "start_offset": 36,
            "end_offset": 39,
            "type": "<ALPHANUM>",
            "position": 7
          },
          {
            "token": "lazy",
            "start_offset": 40,
            "end_offset": 44,
            "type": "<ALPHANUM>",
            "position": 8
          },
          {
            "token": "dog's",
            "start_offset": 45,
            "end_offset": 50,
            "type": "<ALPHANUM>",
            "position": 9
          },
          {
            "token": "bone",
            "start_offset": 51,
            "end_offset": 55,
            "type": "<ALPHANUM>",
            "position": 10
          }
        ]
      }
      View Code
  • whitespace 空格為分隔符
POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
-->  [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]

  simple

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

stop 預設stopwords用_english_

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
-->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
可選引數:
# stopwords
# stopwords_path

keyword 不分詞的

POST _analyze
{
  "analyzer": "keyword",
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
}
得到  "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一條完整的語句

==================================================================================

第三方analyzer外掛---中文分詞(ik分詞器)

es內建很多分詞器,但是對中文分詞並不友好,例如使用standard分詞器對一句中文話進行分詞,會分成一個字一個字的。這時可以使用第三方的Analyzer外掛,比如 ik、pinyin等。這裡以ik為例

1,首先安裝外掛,重啟es:

# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
# /etc/init.d/elasticsearch restart

2,使用示例:

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "你好嗎?我有一句話要對你說呀。"
}
{
  "tokens": [
    {
      "token": "你好",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "好嗎",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "我",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "有",
      "start_offset": 5,
      "end_offset": 6,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "一句話",
      "start_offset": 6,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "一句",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "一",
      "start_offset": 6,
      "end_offset": 7,
      "type": "TYPE_CNUM",
      "position": 6
    },
    {
      "token": "句話",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "句",
      "start_offset": 7,
      "end_offset": 8,
      "type": "COUNT",
      "position": 8
    },
    {
      "token": "話",
      "start_offset": 8,
      "end_offset": 9,
      "type": "CN_CHAR",
      "position": 9
    },
    {
      "token": "要對",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 10
    },
    {
      "token": "你",
      "start_offset": 11,
      "end_offset": 12,
      "type": "CN_CHAR",
      "position": 11
    },
    {
      "token": "說呀",
      "start_offset": 12,
      "end_offset": 14,
      "type": "CN_WORD",
      "position": 12
    }
  ]
}

分詞結果
View Code

參考:https://github.com/medcl/elasticsearch-analysis-ik

還可以用內建的 character filter, tokenizer, token filter 組裝一個analyzer(custom analyzer)

  • custom 定製analyzer,由以下幾部分組成
    • 0個或多個e character filters
    • 1個tokenizer
    • 0個或多個 token filters