1. 程式人生 > 程式設計 >999 - Elasticsearch Analysis 02 - Analyzer

999 - Elasticsearch Analysis 02 - Analyzer

Standard Analyzer

  • 預設的analyzer,適合大多數語言。
  • 根據Unicode Text Segmentation演演算法的定義,將文字切分成詞元。
  • 刪除多數標點符號、將詞元小寫,支援刪除停止詞。

standard analyzer由以下構成:

  • Standard Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Standard Token Filter
    • Lower Case Token Filter
    • Stop Token Filter(預設禁用)

Standard Analyzer 示例

POST _analyze
{
  "analyzer"
: "standard","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } 複製程式碼

產生[ the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone ]

Standard Analyzer 配置

引數 說明
max_token_length 提取單詞時,允許的單詞長度。預設255。
stopwords 可以使用預定義停止詞列表(例如_english_),或一個停止詞陣列。預設_none_
stopwords_path 包含停止符的檔案的路徑。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard","max_token_length": 5,"stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
} 複製程式碼

產生[ 2,jumpe,d,bone ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_array_analyzer": {
          "type": "standard","stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_array_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼

產生[lazy]

Simple Analyzer

  • simple analyzer遇到非字母就會切分,並且會小寫。

simple analyzer由以下構成:

  • Tokenizer
    • Lower Case Tokenizer

Simple Analyzer 示例

POST _analyze
{
  "analyzer": "simple",dog,s,bone ]

Whitespace Analyzer

  • 遇到空格符就切分

whitespace analyzer由以下構成:

  • Tokenizer
    • Whitespace Tokenizer

Whitespace Analyzer 示例

POST _analyze
{
  "analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼

產生[ The,QUICK,Brown-Foxes,bone. ]

Stop Analyzer

  • 類似simple analyzer,但是支援刪除停止詞。預設使用_english_停止詞。

stop analyzer由以下構成:

  • Tokenizer
    • Lower Case Tokenizer
  • Token filters
    • Stop Token Filter

Stop Analyzer 示例

POST _analyze
{
  "analyzer": "stop","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼

產生[ quick,bone ]

Stop Analyzer 配置

引數 說明
stopwords 可以使用預定義停止詞列表(例如_english_),或一個停止詞陣列。預設_english_
stopwords_path 包含停止符的檔案的路徑,路徑相對於Elasticsearch的config目錄。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer":{
          "type": "stop","stopwords":  ["the","dog","s","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼

產生[ lazy ]

Keyword Analyzer

  • 不切分作為一整個詞元輸出。

keyword analyzer由以下構成:

  • Tokenizer
    • Keyword Tokenizer

Keyword Analyzer 示例

POST _analyze
{
  "analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
複製程式碼

產生[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

Pattern Analyzer

  • 按照正則表示式去切分,預設為\W+

pattern analyzer由以下構成:

  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (預設禁用)

Pattern Analyzer 示例

POST _analyze
{
  "analyzer": "pattern",bone ]

Pattern Analyzer 配置

引數 說明
pattern 使用Java正則表示式。預設\W+
flags Java正則表示式flags,多個用|分離,例如"CASE_INSENSITIVE | COMMENTS"。
lowercase 是否小寫。預設true
stopwords 可以使用預定義停止詞列表(例如_english_),或一個停止詞陣列。預設_none_
stopwords_path 包含停止符的檔案的路徑。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_analyzer": {
          "type": "pattern","pattern": "\\W|_","lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_pattern_analyzer","text": "[email protected]"
}
複製程式碼

產生[ john,smith,foo,bar,com ]

Fingerprint Analyzer

  • 小寫,規範化刪掉擴充套件符,排序,去重。
  • 也可以配置停止符。 fingerprint tokenizer 由以下構成:
  • Tokenizer
    • Standard Tokenizer
  • Token Filters(依次如下)
    • Lower Case Token Filter
    • ASCII Folding Token Filter
    • Stop Token Filter (預設禁用)
    • Fingerprint Token Filter

Fingerprint Analyzer 示例

POST _analyze
{
  "analyzer": "fingerprint","text": "Yes yes,Gödel said this sentence is consistent and."
}
複製程式碼

產生[ and consistent godel is said sentence this yes ]

Fingerprint Analyzer 配置

引數 說明
separator 連線條件。預設是空格。
max_output_size 詞元最大長度,超過會被丟棄(不是超過部分被丟棄,而且超過這個長度整條被丟棄)。預設255。
stopwords 可以使用預定義停止詞列表(例如_english_),或一個停止詞陣列。預設_none_
stopwords_path 包含停止符的檔案的路徑。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint","stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}

複製程式碼

產生[ consistent godel said sentence yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint","stopwords": "_english_","separator": "-"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}
複製程式碼

產生[ consistent-godel-said-sentence-yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint","max_output_size": 30
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",Gödel said this sentence is consistent and."
}
複製程式碼

什麼都不產生,整條被丟棄,

補充說明

  • Whitespace會遇到空格就拆分,而Standard則是提取出單詞,例如:對於Brown-Foxes,Whitespace切分之後還是這樣,而Standard切分後則是brownfoxes
  • Simple遇到非字母就切分,而Standard未必,例如:對於dog's,Simple會切分成dogs,而Standard切分後則是dog's
  • 總之,Whitespace遇到空格就切分,Simple遇到非字母就切分,Standard切分單詞(可以是所有格形式)。

自定義Analyzer

  • 0或更多的character filter
  • 一個tokenizer
  • 0或更多的token filter

自定義Analyzer的配置

引數 說明
tokenizer 內建或自定義的tokenizer
char_filter 內建或自定義的character filter,可選
filter 內建或自定義的token filter,可選
position_increment_gap 當一個欄位值為陣列且有多個值時,為了防止跨值匹配,修改值的position。預設100。例如[ "John Abraham","Lincoln Smith"]為拆分之後position為1,2,103,104,這樣就防止了跨值匹配。更具體的看Mapping文章的position_increment_gap部分。

示例1:

  • Character Filter
    • HTML Strip Character Filter
  • Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • ASCII-Folding Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom","tokenizer": "standard","char_filter":[
            "html_strip"
            ],"filter": [
            "lowercase","asciifolding"
            ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer","text": "Is this <b>déjà vu</b>?"
}
複製程式碼

產生[ is,this,deja,vu ]

示例2

  • Character Filter
    • Mapping Character Filter:替換:)為_happy_以及 :( 為_sad_
  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • Stop Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom","char_filter": [
              "emoticons"
            ],"tokenizer": "punctuation","english_stop"
            ]
        }
      },"tokenizer": {
        "punctuation": {
          "type": "pattern","pattern": "[ .,!?]"
        }
      },"char_filter": {
        "emoticons": {
          "type": "mapping","mappings": [
            ":) => _happy_",":( => _sad_"
            ]
        }
      },"filter": {
        "english_stop":{
          "type": "stop","stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer","text":     "I'm a :) person,and you?"
}
複製程式碼

產生[ i'm,_happy_,person,you ]