1. 程式人生 > 其它 >es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2

es - elasticsearch自定義分析器 - 內建分詞過濾器 - 2

技術標籤:stack - eses

世界上並沒有完美的程式,但是我們並不因此而沮喪,因為寫程式就是一個不斷追求完美的過程。

自定義分析器 :

  1. Character filters :
    1. 作用 : 字元的增、刪、改轉換
    2. 數量限制 : 可以有0個或多個
    3. 內建字元過濾器 :
    1. HTML Strip Character filter : 去除html標籤
    2. Mapping Character filter : 對映替換
    3. Pattern Replace Character filter : 正則替換
  2. Tokenizer :
    1. 作用 :
    1. 分詞
    2. 記錄詞的順序和位置(短語查詢)
    3. 記錄詞的開頭和結尾位置(高亮)
    4. 記錄詞的型別(分類)
    2. 數量限制 : 有且只能有一個
    3. 分類 :
    1. 完整分詞 :
    1. Standard
    2. Letter
    3. Lowercase
    4. whitespace
    5. UAX URL Email
    6. Classic
    7. Thai
    2. 切詞 :
    1. N-Gram
    2. Edge N-Gram
    3. 文字 :
    1. Keyword
    2. Pattern
    3. Simple Pattern
    4. Char Group
    5. Simple Pattern split
    6. Path
  3. Token filters :
    1. 作用 : 分詞的增、刪、改轉換
    2. 數量限制 : 可以有0個或多個
    3. 分類 :
    1. apostrophe
    2. asciifolding
    3. cjk bigram
    4. cjk width
    5. classic
    6. common grams
    7. conditional
    8. decimal digit

今天演示內容中 :

  1. common grams token filter
  2. conditional token filter
  3. 重點關注。
# classic token filter
# 作用 :
#   1. 刪除'及後面字元
#   2. 刪除縮寫間的點
# 適用 : classic分詞器
GET /_analyze
{
"tokenizer": "classic", "filter": ["classic"], "text": ["hello this is hi's good H.J.K.M. Q.U.I.C.K. "] } # 結果 { "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "this", "start_offset" : 6, "end_offset" : 10, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "is", "start_offset" : 11, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "hi", "start_offset" : 14, "end_offset" : 18, "type" : "<APOSTROPHE>", "position" : 3 }, { "token" : "good", "start_offset" : 19, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "HJKM", "start_offset" : 24, "end_offset" : 32, "type" : "<ACRONYM>", "position" : 5 }, { "token" : "QUICK", "start_offset" : 33, "end_offset" : 43, "type" : "<ACRONYM>", "position" : 6 } ] }
# common grams token filter
# 作用 :
#   1. 指定的詞與前後詞結合
#   2. 可以避免停用詞造成的損失
# 配置項 :
#   1. common_words      : 要結合的詞
#   2. common_words_path : 要結合詞的路徑
#   3. ignore_case       : 忽略大小寫,預設false
#   4. query_mode        : 是否單獨顯示指定的結合的詞,預設false - 顯示
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
    "type"         : "common_grams",
    "common_words" : ["是", "的", "Is"],
    "ignore_case"  : true,
    "query_mode"   : true
  }],
  "text": ["我是中國人", "這是我的飯", "this is my food"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "我_是",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "gram",
      "position" : 0
    },
    {
      "token" : "是_中",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "gram",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "國",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "這_是",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "gram",
      "position" : 105
    },
    {
      "token" : "是_我",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "gram",
      "position" : 106
    },
    {
      "token" : "我_的",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "gram",
      "position" : 107
    },
    {
      "token" : "的_飯",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "gram",
      "position" : 108
    },
    {
      "token" : "this_is",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "gram",
      "position" : 209
    },
    {
      "token" : "is_my",
      "start_offset" : 17,
      "end_offset" : 22,
      "type" : "gram",
      "position" : 210
    },
    {
      "token" : "my",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 211
    },
    {
      "token" : "food",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 212
    }
  ]
}

# conditional token filter
# 作用   : 條件過濾,以條件判斷是否執行過濾器中的內容
# 配置項 :
#   1. filter : 過濾器
#   2. script : 過濾指令碼
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
      "type"   : "condition",
      "filter" : ["lowercase"],
      "script" : {
        "source": "token.getTerm().length() < 5"
      }
  }], 
  "text": ["THE QUICK BROWN FOX"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "QUICK",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "BROWN",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
# decimal digit token filter
# 作用 : 特殊數字字元轉為阿拉伯數字
GET /_analyze
{
  "tokenizer": "keyword",
  "filter": ["decimal_digit"],
  "text": ["6.7 १ १-one two-२ ३ "]
}

# 結果
{
  "tokens" : [
    {
      "token" : "6.7 1 1-one two-2 3 ",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    }
  ]
}