1. 程式人生 > 其它 >es - elasticsearch自定義分析器 - 內建分詞器

es - elasticsearch自定義分析器 - 內建分詞器

技術標籤:stack - eses

世界上並沒有完美的程式,但是我們並不因此而沮喪,因為寫程式就是一個不斷追求完美的過程。

自定義分析器 :

  1. Character filters :
    1. 作用 : 字元的增、刪、改轉換
    2. 數量限制 : 可以有0個或多個
    3. 內建字元過濾器 :
    1. HTML Strip Character filter : 去除html標籤
    2. Mapping Character filter : 對映替換
    3. Pattern Replace Character filter : 正則替換
  2. Tokenizer :
    1. 作用 :
    1. 分詞
    2. 記錄詞的順序和位置(短語查詢)
    3. 記錄詞的開頭和結尾位置(高亮)
    4. 記錄詞的型別(分類)
    2. 數量限制 : 有且只能有一個
    3. 分類 :
    1. 完整分詞 :
    1. Standard
    2. Letter
    3. Lowercase
    4. whitespace
    5. UAX URL Email
    6. Classic
    7. Thai
    2. 切詞 :
    1. N-Gram
    2. Edge N-Gram
    3. 文字 :
    1. Keyword
    2. Pattern
    3. Simple Pattern
    4. Char Group
    5. Simple Pattern split
    6. Path
  3. Token filters :
    1. 作用 : 分詞的增、刪、改轉換
    2. 數量限制 : 可以有0個或多個

今天演示文字結構的分詞器 :

# keyword tokenizer
# 原樣返回
GET /_analyze
{
  "tokenizer": "keyword",
  "text": ["hello world", "我是中國人"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello world",
      "start_offset"
: 0, "end_offset" : 11, "type" : "word", "position" : 0 }, { "token" : "我是中國人", "start_offset" : 12, "end_offset" : 17, "type" : "word", "position" : 101 } ] }
# pattern tokenizer
# 基於正則的分詞,取正則匹配為詞
# 配置項 :
#   1. pattern : 正則表示式
#   2. flags
#   3. group   : 組的個數,預設 -1
GET /_analyze
{
  "tokenizer": {
    "type" : "pattern",
    "pattern" : "((?:[a-z0-9])+)",
    "group" : 1
  },
  "text": ["hello 23456"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "23456",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    }
  ]
}

# simple pattern analyzer
# 使用lucene的正則,取匹配正則的詞
# 必須指定pattern
# 配置項 : pattern
GET /_analyze
{
  "tokenizer": {
    "type" : "simple_pattern",
    "pattern" : "[0-9]{3}" 
  },
  "text": ["3456786544433 fsdfsd"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "345",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "678",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "654",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "443",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    }
  ]
}
# char group tokenizer
# 基於指定字元分詞
# 配置項 :
#   1. tokenize_on_chars : 分詞的字元
#   2. max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "char_group",
    "tokenize_on_chars" : [
        "-", "whitespace", "_"
      ]
  },
  "text": ["sdjflds sdfsd-sdf-7879 fsd_us9098"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "sdjflds",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "sdfsd",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "sdf",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "7879",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "fsd",
      "start_offset" : 23,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "us9098",
      "start_offset" : 27,
      "end_offset" : 33,
      "type" : "word",
      "position" : 5
    }
  ]
}

# simple pattern split tokenizer
# 以正則分詞
# 配置項 : pattern
GET /_analyze
{
  "tokenizer": {
    "type" : "simple_pattern_split",
    "pattern" : "[0-9]{3}"
  },
  "text": ["sdfsd23243sdfsd890sdfs"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "sdfsd",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "43sdfsd",
      "start_offset" : 8,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "sdfs",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 2
    }
  ]
}
# path hierarchy tokenizer
# 路徑拆分及轉換
# 配置項 :
#   1. delimiter   : 拆分字元
#   2. replacement : 替換字元
#   3. buffer_size : 不建議修改
#   4. reverse     : 預設 false
#       1. 分出的路徑的顯示順序反轉
#       2. skip的位置的反轉
#   5. skip        : 初始位置,預設 0
GET /_analyze
{
  "tokenizer": {
    "type"        : "path_hierarchy",
    "delimiter"   : "-",
    "replacement" : "/",
    "reverse"     : true,
    "skip"        : 1
  },
  "text": ["hello-good-this-is-me"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "hello/good/this/is/",
      "start_offset" : 0,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "good/this/is/",
      "start_offset" : 6,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "this/is/",
      "start_offset" : 11,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is/",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    }
  ]
}