elasticsearch 分詞過程
阿新 • • 發佈:2019-01-11
| analyzer | logical name | description | | ----------------------|:-------------:| :-----------------------------------------| | standard analyzer | standard | standard tokenizer, standard filter, lower case filter, stop filter | | simple analyzer | simple | lower case tokenizer | | stop analyzer | stop | lower case tokenizer, stop filter | | keyword analyzer | keyword | 不分詞,內容整體作為一個token(not_analyzed) | | pattern analyzer | whitespace | 正則表示式分詞,預設匹配\W+ | | language analyzers | [lang](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html) | 各種語言 | | snowball analyzer | snowball | standard tokenizer, standard filter, lower case filter, stop filter, snowball filter | | custom analyzer | custom | 一個Tokenizer, 零個或多個Token Filter, 零個或多個Char Filter | tokenizer:ES內建的tokenizer列表。 | tokenizer | logical name | description | | ----------------------|:-------------:| :-------------------------------------| | standard tokenizer | standard | | | edge ngram tokenizer | edgeNGram | | | keyword tokenizer | keyword | 不分詞 | | letter analyzer | letter | 按單詞分 | | lowercase analyzer | lowercase | letter tokenizer, lower case filter | | ngram analyzers | nGram | | | whitespace analyzer | whitespace | 以空格為分隔符拆分 | | pattern analyzer | pattern | 定義分隔符的正則表示式 | | uax email url analyzer| uax_url_email | 不拆分url和email | | path hierarchy analyzer| path_hierarchy| 處理類似`/path/to/somthing`樣式的字串| token filter:ES內建的token filter列表。 | token filter | logical name | description | | ----------------------|:-------------:| :-------------------------------------| | standard filter | standard | | | ascii folding filter | asciifolding | | | length filter | length | 去掉太長或者太短的 | | lowercase filter | lowercase | 轉成小寫 | | ngram filter | nGram | | | edge ngram filter | edgeNGram | | | porter stem filter | porterStem | 波特詞幹演算法 | | shingle filter | shingle | 定義分隔符的正則表示式 | | stop filter | stop | 移除 stop words | | word delimiter filter | word_delimiter| 將一個單詞再拆成子分詞 | | stemmer token filter | stemmer | | | stemmer override filter| stemmer_override| | | keyword marker filter | keyword_marker| | | keyword repeat filter | keyword_repeat| | | kstem filter | kstem | | | snowball filter | snowball | | | phonetic filter | phonetic | [外掛](https://github.com/elasticsearch/elasticsearch-analysis-phonetic) | | synonym filter | synonyms | 處理同義詞 | | compound word filter | dictionary_decompounder, hyphenation_decompounder | 分解複合詞 | | reverse filter | reverse | 反轉字串 | | elision filter | elision | 去掉縮略語 | | truncate filter | truncate | 截斷字串 | | unique filter | unique | | | pattern capture filter| pattern_capture| | | pattern replace filte | pattern_replace| 用正則表示式替換 | | trim filter | trim | 去掉空格 | | limit token count filter| limit | 限制token數量 | | hunspell filter | hunspell | 拼寫檢查 | | common grams filter | common_grams | | | normalization filter | arabic_normalization, persian_normalization | | character filter:ES內建的character filter列表 | character filter | logical name | description | | --------------------------|:-------------:| :-------------------------| | mapping char filter | mapping | 根據配置的對映關係替換字元 | | html strip char filter | html_strip | 去掉HTML元素 | | pattern replace char filter| pattern_replace| 用正則表示式處理字串 |