Elasticsearch 分詞器

阿新 • • 發佈：2019-01-02

pin 文本分析 mail ima 6.5 java正則表達式 lan 6.2 offset

無論是內置的分析器（analyzer），還是自定義的分析器（analyzer），都由三種構件塊組成的：character filters ， tokenizers ， token filters。

內置的analyzer將這些構建塊預先打包到適合不同語言和文本類型的analyzer中。

Character filters （字符過濾器）

字符過濾器以字符流的形式接收原始文本，並可以通過添加、刪除或更改字符來轉換該流。

舉例來說，一個字符過濾器可以用來把阿拉伯數字（?‎????????‎?）‎轉成成Arabic-Latin的等價物（0123456789）。

一個分析器可能有0個或多個字符過濾器，它們按順序應用。

（PS：類似Servlet中的過濾器，或者攔截器，想象一下有一個過濾器鏈）

Tokenizer （分詞器）

一個分詞器接收一個字符流，並將其拆分成單個token （通常是單個單詞），並輸出一個token流。例如，一個whitespace分詞器當它看到空白的時候就會將文本拆分成token。它會將文本“Quick brown fox!”轉換為[Quick, brown, fox!]

（PS：Tokenizer 負責將文本拆分成單個token ，這裏token就指的就是一個一個的單詞。就是一段文本被分割成好幾部分，相當於Java中的字符串的 split ）

分詞器還負責記錄每個term的順序或位置，以及該term所表示的原單詞的開始和結束字符偏移量。（PS：文本被分詞後的輸出是一個term數組）

一個分析器必須只能有一個分詞器

Token filters （token過濾器）

token過濾器接收token流，並且可能會添加、刪除或更改tokens。

例如，一個lowercase token filter可以將所有的token轉成小寫。stop token filter可以刪除常用的單詞，比如 the 。synonym token filter可以將同義詞引入token流。

不允許token過濾器更改每個token的位置或字符偏移量。

一個分析器可能有0個或多個token過濾器，它們按順序應用。

小結&回顧

analyzer（分析器）是一個包，這個包由三部分組成，分別是：character filters （字符過濾器）、tokenizer（分詞器）、token filters（token過濾器）

一個analyzer可以有0個或多個character filters

一個analyzer有且只能有一個tokenizer

一個analyzer可以有0個或多個token filters

character filter 是做字符轉換的，它接收的是文本字符流，輸出也是字符流

tokenizer 是做分詞的，它接收字符流，輸出token流（文本拆分後變成一個一個單詞，這些單詞叫token）

token filter 是做token過濾的，它接收token流，輸出也是token流

由此可見，整個analyzer要做的事情就是將文本拆分成單個單詞，文本 ----> 字符 ----> token

技術分享圖片

這就好比是攔截器

技術分享圖片

1. 測試分析器

analyze API 是一個工具，可以幫助我們查看分析的過程。（PS：類似於執行計劃）

curl -X POST "192.168.1.134:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
‘

curl -X POST "192.168.1.134:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
‘

輸出：

{
    "tokens":[
        {
            "token":"The",
            "start_offset":0,
            "end_offset":3,
            "type":"word",
            "position":0
        },
        {
            "token":"quick",
            "start_offset":4,
            "end_offset":9,
            "type":"word",
            "position":1
        },
        {
            "token":"brown",
            "start_offset":10,
            "end_offset":15,
            "type":"word",
            "position":2
        },
        {
            "token":"fox.",
            "start_offset":16,
            "end_offset":20,
            "type":"word",
            "position":3
        }
    ]
}

可以看到，對於每個term，記錄了它的位置和偏移量

2. Analyzer

2.1. 配置內置的分析器

內置的分析器不用任何配置就可以直接使用。當然，默認配置是可以更改的。例如，standard分析器可以配置為支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}
‘

在這個例子中，我們基於standard分析器來定義了一個std_englisth分析器，同時配置為刪除預定義的英語停止詞列表。後面的mapping中，定義了my_text字段用standard，my_text.english用std_english分析器。因此，下面兩個的分詞結果會是這樣的：

curl -X POST "localhost:9200/my_index/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "field": "my_text", 
  "text": "The old brown cow"
}
‘
curl -X POST "localhost:9200/my_index/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
‘

第一個由於用的standard分析器，因此分詞的結果是：[ the, old, brown, cow ]

第二個用std_english分析的結果是：[ old, brown, cow ]

2.2. Standard Analyzer （默認）

如果沒有特別指定的話，standard 是默認的分析器。它提供了基於語法的標記化（基於Unicode文本分割算法），適用於大多數語言。

例如：

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

上面例子中，那段文本將會輸出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog‘s, bone ]

2.2.1. 配置

標準分析器接受下列參數：

max_token_length ：最大token長度，默認255
stopwords ：預定義的停止詞列表，如_english_ 或包含停止詞列表的數組，默認是 _none_
stopwords_path ：包含停止詞的文件路徑

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
‘
curl -X POST "localhost:9200/my_index/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

以上輸出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog‘s, bone ]

2.2.3. 定義

standard分析器由下列兩部分組成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默認被禁用）

你還可以自定義

curl -X PUT "localhost:9200/standard_example" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}
‘

2.3. Simple Analyzer

simple 分析器當它遇到只要不是字母的字符，就將文本解析成term，而且所有的term都是小寫的。例如：

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

輸入結果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定義

curl -X PUT "localhost:9200/simple_example" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}
‘

2.4. Whitespace Analyzer

whitespace 分析器，當它遇到空白字符時，就將文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

輸出結果如下：

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog‘s, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了對刪除停止詞的支持。默認用的停止詞是 _englisht_

（PS：意思是，假設有一句話“this is a apple”，並且假設“this” 和 “is”都是停止詞，那麽用simple的話輸出會是[ this , is , a , apple ]，而用stop輸出的結果會是[ a , apple ]，到這裏就看出二者的區別了，stop 不會輸出停止詞，也就是說它不認為停止詞是一個term）

（PS：所謂的停止詞，可以理解為分隔符）

2.5.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

輸出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下參數：

stopwords ：一個預定義的停止詞列表（比如，_englisht_）或者是一個包含停止詞的列表。默認是 _english_
stopwords_path ：包含停止詞的文件路徑。這個路徑是相對於Elasticsearch的config目錄的一個路徑

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}
‘

上面配置了一個stop分析器，它的停止詞有兩個：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

基於以上配置，這個請求輸入會是這樣的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正則表達式來將文本分割成terms，默認的正則表達式是\W+（非單詞字符）

2.6.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

由於默認按照非單詞字符分割，因此輸出會是這樣的：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下參數：

pattern ：一個Java正則表達式，默認 \W+
flags ： Java正則表達式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否將terms全部轉成小寫。默認true
stopwords ：一個預定義的停止詞列表，或者包含停止詞的一個列表。默認是 _none_
stopwords_path ：停止詞文件路徑

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}
‘

上面的例子中配置了按照非單詞字符或者下劃線分割，並且輸出的term都是小寫

curl -X POST "localhost:9200/my_index/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}
‘

因此，基於以上配置，本例輸出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同語言環境下的文本分析。內置（預定義）的語言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定義Analyzer

前面也說過，一個分析器由三部分構成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 實例配置

curl -X PUT "localhost:9200/my_index" -H ‘Content-Type: application/json‘ -d‘
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}
‘

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H ‘Content-Type: application/json‘ -d‘
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
‘

4. 中文分詞器

4.1. smartCN

一個簡單的中文或中英文混合文本的分詞器

這個插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安裝
bin/elasticsearch-plugin install analysis-smartcn
# 卸載
bin/elasticsearch-plugin remove analysis-smartcn

技術分享圖片

下面測試一下

技術分享圖片

可以看到，“今天天氣真好”用smartcn分析器的結果是：

[ 今天 ， 天氣 ， 真 ， 好 ]

如果用standard分析器的話，結果會是：

[ 今 ，天 ，氣 ， 真 ， 好 ]

4.2. IK分詞器

下載對應的版本，這裏我下載6.5.3

技術分享圖片

然後，在Elasticsearch的plugins目錄下建一個ik目錄，將剛才下載的文件解壓到該目錄下

技術分享圖片

最後，重啟Elasticsearch

接下來，還是用剛才那句話來測試一下

技術分享圖片

輸出結果如下：

{
    "tokens": [
        {
            "token": "今天天氣",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "天天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天氣",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "真好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

顯然比smartcn要更好一點

5. 參考

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

Elasticsearch 分詞器

Elasticsearch 分詞器

ElasticSearch分詞器總結

elasticsearch分詞器

Elasticsearch——分詞器對String的作用

Elasticsearch 分詞器

Java開發中的Elasticsearch分詞器的定義與用法一

elasticsearch 分詞器外掛安裝 windows + Linux

elasticSearch 分詞器踩的坑

ElasticSearch 用ik分詞器建立索引（java API）

Elasticsearch入門之從零開始安裝ik分詞器

elasticsearch系列三：索引詳解（分詞器、文檔管理、路由詳解）

ElasticSearch搜索引擎安裝配置中文分詞器IK插件

Linux 安裝Elasticsearch和配置ik分詞器步驟

Linux下ElasticSearch6.4.x、ElasticSearch-Head、Kibana以及中文分詞器IK的安裝配置

ElasticSearch最全分詞器比較及使用方法

ElasticSearch最佳入門實踐（四十）分詞器的內部組成到底是什麼，以及內建分詞器的介紹

ElasticSearch教程——自定義分詞器

ElasticSearch-6.4.1安裝中文分詞器Analysis-ik.

ElasticSearch實戰二（es基本操作以及IK分詞器的安裝）

ElasticSearch最佳入門實踐（六十一）修改分詞器以及定製自己的分詞器

Elasticsearch整合HanLP分詞器

Elasticsearch 分詞器

相關推薦