es - elasticsearch自定義分析器 - 內建分詞過濾器 - 6

阿新 • • 發佈：2021-02-02

世界上並沒有完美的程式，但是我們並不因此而沮喪，因為寫程式就是一個不斷追求完美的過程。

自定義分析器 :

Character filters :
1. 作用 : 字元的增、刪、改轉換
2. 數量限制 : 可以有0個或多個
3. 內建字元過濾器 :
1. HTML Strip Character filter : 去除html標籤
2. Mapping Character filter : 對映替換
3. Pattern Replace Character filter : 正則替換
Tokenizer :
1. 作用 :
1. 分詞
2. 記錄詞的順序和位置（短語查詢）

3. 記錄詞的開頭和結尾位置（高亮）
4. 記錄詞的型別（分類）
2. 數量限制 : 有且只能有一個
3. 分類 :
1. 完整分詞 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切詞 :
1. N-Gram
2. Edge N-Gram
3. 文字 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
Token filters :
1. 作用 : 分詞的增、刪、改轉換

2. 數量限制 : 可以有0個或多個
3. 分類 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
5. classic
6. common grams
7. conditional
8. decimal digit
9. delimited payload
10. dictionary decompounder
11. edge ngram
12. elision
13. fingerprint
14. flatten_graph
15. hunspell
16. hyphenation decompounder
17. keep types

18. keep words
19. keyword marker
20. keyword repeat
21. kstem
22. length
23. limit token count

今天演示20-23，這四個都比較直接有用

# keyword repeat token filter
# 作用 :
#   1. 自動識別keyword，無需詞幹提取，原型輸出
#   2. 重複顯示
# 配合過濾器 :
#   1. stemmer 詞幹提取
#   2. remove_duplicates 去除重複的詞
GET /_analyze
{
  "tokenizer"  : "whitespace",
  "filter"     : ["keyword_repeat", "stemmer", "remove_duplicates"],
  "text"       : ["hello gooding me"],
  "explain"    : false,
  "attributes" : ["keyword"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gooding",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    }
  ]
}

# kstem token filter
# 英語詞幹提取器
GET /_analyze
{
  "tokenizer": "standard",
  "filter"   : ["kstem"],
  "text"     : ["hello gooding me"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "good",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

# length token filter
# 根據指定的單詞長度過濾
# 配置項 :
#   1. min : 單詞最小長度，預設0
#   2. max : 單詞最大長度，預設Integer.MAX_VALUE
GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter"    : [{
    "type" : "length",
    "min"  : 2,
    "max"  : 5
  }],
  "text" : ["hello gooding me"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    }
  ]
}

# limit token count token filter
# 作用   : 限制輸出token的個數
# 配置項 :
#   1. max_token_count    : 輸出的最大token數
#   2. consume_all_tokens : 測試了一下，沒什麼作用
GET /_analyze
{
  "tokenizer" : "standard",
  "filter"    : [{
    "type"            : "limit",
    "max_token_count" : 2
  }],
  "text" : ["hello gooding me"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "gooding",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}