999 - Elasticsearch Analysis 02 - Analyzer
阿新 • • 發佈:2019-12-31
Standard Analyzer
- 預設的analyzer,適合大多數語言。
- 根據
Unicode Text Segmentation
演演算法的定義,將文字切分成詞元。 - 刪除多數標點符號、將詞元小寫,支援刪除停止詞。
standard
analyzer由以下構成:
- Standard Tokenizer
- Standard Tokenizer
- Token Filters
- Standard Token Filter
- Lower Case Token Filter
- Stop Token Filter(預設禁用)
Standard Analyzer 示例
POST _analyze
{
"analyzer" : "standard","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone ]
。
Standard Analyzer 配置
引數 | 說明 |
---|---|
max_token_length | 提取單詞時,允許的單詞長度。預設255。 |
stopwords | 可以使用預定義停止詞列表(例如_english_ ),或一個停止詞陣列。預設_none_ 。 |
stopwords_path | 包含停止符的檔案的路徑。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard","max_token_length": 5,"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_english_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ 2,jumpe,d,bone ]
。
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_array_analyzer": {
"type": "standard","stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_array_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[lazy]
。
Simple Analyzer
-
simple
analyzer遇到非字母就會切分,並且會小寫。
simple
analyzer由以下構成:
- Tokenizer
- Lower Case Tokenizer
Simple Analyzer 示例
POST _analyze
{
"analyzer": "simple",dog,s,bone ]
。
Whitespace Analyzer
- 遇到空格符就切分
whitespace
analyzer由以下構成:
- Tokenizer
- Whitespace Tokenizer
Whitespace Analyzer 示例
POST _analyze
{
"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ The,QUICK,Brown-Foxes,bone. ]
。
Stop Analyzer
- 類似
simple
analyzer,但是支援刪除停止詞。預設使用_english_
停止詞。
stop
analyzer由以下構成:
- Tokenizer
- Lower Case Tokenizer
- Token filters
- Stop Token Filter
Stop Analyzer 示例
POST _analyze
{
"analyzer": "stop","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ quick,bone ]
Stop Analyzer 配置
引數 | 說明 |
---|---|
stopwords | 可以使用預定義停止詞列表(例如_english_ ),或一個停止詞陣列。預設_english_ 。 |
stopwords_path | 包含停止符的檔案的路徑,路徑相對於Elasticsearch的config目錄。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer":{
"type": "stop","stopwords": ["the","dog","s","bone"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ lazy ]
Keyword Analyzer
- 不切分作為一整個詞元輸出。
keyword
analyzer由以下構成:
- Tokenizer
- Keyword Tokenizer
Keyword Analyzer 示例
POST _analyze
{
"analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼
產生[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
。
Pattern Analyzer
- 按照正則表示式去切分,預設為
\W+
pattern
analyzer由以下構成:
- Tokenizer
- Pattern Tokenizer
- Token Filters
- Lower Case Token Filter
- Stop Token Filter (預設禁用)
Pattern Analyzer 示例
POST _analyze
{
"analyzer": "pattern",bone ]
Pattern Analyzer 配置
引數 | 說明 |
---|---|
pattern | 使用Java正則表示式。預設\W+ 。 |
flags | Java正則表示式flags,多個用| 分離,例如"CASE_INSENSITIVE | COMMENTS"。 |
lowercase | 是否小寫。預設true 。 |
stopwords | 可以使用預定義停止詞列表(例如_english_ ),或一個停止詞陣列。預設_none_ 。 |
stopwords_path | 包含停止符的檔案的路徑。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_analyzer": {
"type": "pattern","pattern": "\\W|_","lowercase": true
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_pattern_analyzer","text": "[email protected]"
}
複製程式碼
產生[ john,smith,foo,bar,com ]
。
Fingerprint Analyzer
- 小寫,規範化刪掉擴充套件符,排序,去重。
- 也可以配置停止符。
fingerprint
tokenizer 由以下構成: - Tokenizer
- Standard Tokenizer
- Token Filters(依次如下)
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (預設禁用)
- Fingerprint Token Filter
Fingerprint Analyzer 示例
POST _analyze
{
"analyzer": "fingerprint","text": "Yes yes,Gödel said this sentence is consistent and."
}
複製程式碼
產生[ and consistent godel is said sentence this yes ]
Fingerprint Analyzer 配置
引數 | 說明 |
---|---|
separator | 連線條件。預設是空格。 |
max_output_size | 詞元最大長度,超過會被丟棄(不是超過部分被丟棄,而且超過這個長度整條被丟棄)。預設255。 |
stopwords | 可以使用預定義停止詞列表(例如_english_ ),或一個停止詞陣列。預設_none_ 。 |
stopwords_path | 包含停止符的檔案的路徑。 |
示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint","stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}
複製程式碼
產生[ consistent godel said sentence yes ]
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint","stopwords": "_english_","separator": "-"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}
複製程式碼
產生[ consistent-godel-said-sentence-yes ]
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer":{
"type": "fingerprint","max_output_size": 30
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}
複製程式碼
什麼都不產生,整條被丟棄,
補充說明
- Whitespace會遇到空格就拆分,而Standard則是提取出單詞,例如:對於
Brown-Foxes
,Whitespace切分之後還是這樣,而Standard切分後則是brown
和foxes
。 - Simple遇到非字母就切分,而Standard未必,例如:對於
dog's
,Simple會切分成dog
和s
,而Standard切分後則是dog's
。 - 總之,Whitespace遇到空格就切分,Simple遇到非字母就切分,Standard切分單詞(可以是所有格形式)。
自定義Analyzer
- 0或更多的
character filter
- 一個
tokenizer
- 0或更多的
token filter
自定義Analyzer的配置
引數 | 說明 |
---|---|
tokenizer | 內建或自定義的tokenizer |
char_filter | 內建或自定義的character filter,可選 |
filter | 內建或自定義的token filter,可選 |
position_increment_gap | 當一個欄位值為陣列且有多個值時,為了防止跨值匹配,修改值的position。預設100 。例如[ "John Abraham","Lincoln Smith"]為拆分之後position為1,2,103,104,這樣就防止了跨值匹配。更具體的看Mapping文章的position_increment_gap 部分。 |
示例1:
- Character Filter
- HTML Strip Character Filter
- Tokenizer
- Standard Tokenizer
- Token Filters
- Lowercase Token Filter
- ASCII-Folding Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom","tokenizer": "standard","char_filter":[
"html_strip"
],"filter": [
"lowercase","asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer","text": "Is this <b>déjà vu</b>?"
}
複製程式碼
產生[ is,this,deja,vu ]
示例2
- Character Filter
- Mapping Character Filter:替換:)為
_happy_
以及 :( 為_sad_
- Mapping Character Filter:替換:)為
- Tokenizer
- Pattern Tokenizer
- Token Filters
- Lowercase Token Filter
- Stop Token Filter
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom","char_filter": [
"emoticons"
],"tokenizer": "punctuation","english_stop"
]
}
},"tokenizer": {
"punctuation": {
"type": "pattern","pattern": "[ .,!?]"
}
},"char_filter": {
"emoticons": {
"type": "mapping","mappings": [
":) => _happy_",":( => _sad_"
]
}
},"filter": {
"english_stop":{
"type": "stop","stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer","text": "I'm a :) person,and you?"
}
複製程式碼
產生[ i'm,_happy_,person,you ]
。