ElasticSearch自定義分析器-整合結巴分詞外掛

阿新 • • 發佈：2019-01-31

關於結巴分詞 ElasticSearch 外掛：

https://github.com/huaban/elasticsearch-analysis-jieba

該外掛由huaban開發。支援Elastic Search 版本<=2.3.5。

結巴分詞分析器

結巴分詞外掛提供3個分析器：jieba_index、jieba_search和jieba_other。

jieba_index: 用於索引分詞，分詞粒度較細；
jieba_search: 用於查詢分詞，分詞粒度較粗；
jieba_other: 全形轉半形、大寫轉小寫、字元分詞；

使用jieba_index或jieba_search分析器，可以實現基本的分詞效果。

以下是最小配置示例：

{
    "mappings": {
        "test": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "jieba_index",
                    "search_analyzer": " 
jieba_index"
                }
            }
        }
    }
}

在生產化境中，因為業務的需要，需要考慮實現以下功能：

支援同義詞；
支援字元過濾器；

結巴外掛提供的分析器jieba_index、jieba_search無法實現以上功能。

自定義分析器

當jieba_index、jieba_search分析器不滿足生成環境的需求時，我們可以使用自定義分析器來解決以上問題。

分析器是由字元過濾器，分詞器，詞元過濾器組成的。

一個分詞器允許包含多個字元過濾器+一個分詞器+多個詞元過濾器。

因業務的需求，我們需要使用對映字元過濾器來實現分詞前某些字串的替換操作。如將使用者輸入的c#替換為csharp，c++替換為cplus。

下面逐一介紹分析器各個組成部分。

1. 對映字元過濾器Mapping Char Filter

這個是Elastic Search內建的對映字元過濾器，位於settings –> analysis -> char_filter下：

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings": [
                      "c# => csharp",
                      "c++ => cplus"
                  ]
                }
            }
        }
    }
}

也可以通過檔案載入字元對映表。

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings_path": "mappings.txt"
                }
            }
        }
    }
}

檔案預設存放config目錄下，即config/ mappings.txt。

2. 結巴分詞詞元過濾器JiebaTokenFilter

JiebaTokenFilter接受一個SegMode引數，該引數有兩個可選值：Index和Search。

我們預先定義兩個詞元過濾器：jieba_index_filter和jieba_search_filter。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
        }
    }
}

這兩個詞元過濾器將分別用於索引分析器和查詢分析器。

3. stop 停用詞詞元過濾器

因分詞詞元過濾器JiebaTokenFilter並不處理停用詞。因此我們在自定義分析器時，需要定義停用詞詞元過濾器來處理停用詞。

Elastic Search提供了停用詞詞元過濾器，我們可以這樣來定義：

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type":       "stop",
                    "stopwords": ["and", "is", "the"]
                }
            }
        }
    }
}

也可以通過檔案載入停用詞列表

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                }
            }
        }
    }
}

檔案預設存放config目錄下，即config/ stopwords.txt。

4. synonym 同義詞詞元過濾器

我們使用ElasticSearch內建同義詞詞元過濾器來實現同義詞的功能。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "stopwords": [
                      "中文,漢語,漢字"
                  ]
                }
            }
        }
    }
}

如果同義詞量比較大時，推薦使用檔案的方式載入同義詞庫。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                }
            }
        }
    }
}

5. 重新定義分析器jieba_index和jieba_search

Elastic Search支援多級分詞，我們使用whitespace分詞作為分詞器；並在詞元過濾器加入定義好的Jiebie分詞詞元過濾器：jieba_index_filter和jieba_search_filter。

PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

注意，上面分析器的命名依然使用jieba_index和jieba_search，以便覆蓋結巴分詞外掛提供的分析器。

當存在多個同名的分析器時，Elastic Search會優先使用索引配置中定義的分析器。

這樣在程式碼呼叫層面便無需再更改。

下面是完整的配置：

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                  "mappings_path": "mappings.txt"
                }
            }
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                },
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                },
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

參考資料：

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html

http://www.tuicool.com/articles/eUJJ3qF

ElasticSearch自定義分析器-整合結巴分詞外掛

關於結巴分詞 ElasticSearch 外掛： https://github.com/huaban/elasticsearch-analysis-jieba 該外掛由huaban開發。支援Elastic Search 版本<=2.3.5。結巴分詞分析器結巴分詞外

Elasticsearch(自定義分析器)

自定義分析器雖然Elasticsearch內建了一系列的分析器，但是真正的強大之處在於定製你自己的分析器。你可以通過在配置檔案中組合字元過濾器，分詞器和表徵過濾器，來滿足特定資料的需求。分析器是三個順序執行的元件的結合（字元過濾器，分詞器，表徵過濾器）。字元

Elasticsearch 自定義分析器 analyzer API 檢視文字內容如何被分析

Elasticsearch內建了一系列分析器，但是ES支援自定義分析器。通過在配置檔案中組合字元過濾器，分詞器和表徵過濾器可以滿足特定資料的要求。分析器是三個順序執行的元件的結合（字元過濾器、分詞器、表徵過濾器）字元過濾器（char_filter）：

hanlp for elasticsearch（基於hanlp的es分詞外掛）

摘要：elasticsearch是使用比較廣泛的分散式搜尋引擎，es提供了一個的單字分詞工具，還有一個分詞外掛ik使用比較廣泛，hanlp是一個自然語言處理包，能更好的根據上下文的語義，人名，地名，組織機構名等來切分詞 Elasticsearch 預設分

Simple: SQLite3 中文結巴分詞外掛

一年前開發 simple 分詞器，實現了微信在兩篇文章中描述的，基於 SQLite 支援中文和拼音的搜尋方案。具體背景參見[這篇文章](https://www.wangfenjin.com/posts/simple-tokenizer/)。專案釋出後受到了一些朋友的關注，後續也釋出了一些改進，提升了專案易用性

Elasticsearch自定義分詞器

什麼是分詞器因為Elasticsearch中預設的標準分詞器分詞器對中文分詞不是很友好，會將中文詞語拆分成一個一箇中文的漢字。因此引入中文分詞器-es-ik外掛演示傳統分詞器 http://192.168.33.129:9200/_analyze {

python呼叫jieba(結巴)分詞加入自定義詞典和去停用詞功能

#!/usr/bin/python #-*- encoding:utf-8 -*- import jieba #匯入jieba模組 import re jieba.load_userdict("newdict.t

Elasticsearch整合HanLP分詞器

1、通過git下載分詞器程式碼。連線如下：https://gitee.com/hualongdata/hanlp-ext hanlp官網如下：http://hanlp.linrunsoft.com/ 2、下載gradle,如果本機有，就可以略過此步驟。通過gradle

Elasticsearch 系列指南（三）——整合ik分詞器

{ "tokens": [ { "token": "聯", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>

elasticsearch 深入 —— 分析器執行順序與Mapping自定義分析器配置

預設分析器雖然我們可以在欄位層級指定分析器，但是如果該層級沒有指定任何的分析器，那麼我們如何能確定這個欄位使用的是哪個分析器呢？分析器可以從三個層面進行定義：按欄位（per-field）、按索引（per-index）或全域性預設（global default）。El

python 分詞、自定義詞表、停用詞、詞頻統計與權值（tfidf）、詞性標註與部分詞性刪除

# -*- coding: utf-8 -*- """ Created on Tue Apr 17 15:11:44 2018 @author: NAU """ ##############分詞、自定義詞表、停用詞################ import jieba

elasticsearch系統分析器及自定義分析器

一、系統自帶的分析器：（1）standard 分析器 standard 分析器是用於全文欄位的預設分析器。它考慮了以下幾點： standard 分詞器，在詞層級上分割輸入的文字。 standard 標記過濾器，被設計用來整理分詞器觸發的所有標記（但是目前什麼都沒做）。 l

python jieba分詞(結巴分詞)、提取詞，載入詞，修改詞頻，定義詞庫

轉載請註明出處歡迎加入Python快速進階QQ群：867300100 “結巴”中文分詞：做最好的 Python 中文分片語件,分詞模組jieba，它是python比較好用的分詞模組, 支援中文簡體，繁體分詞，還支援自定義詞庫。 jieba的分詞，提取關鍵詞，

ElasticSearch入門 - 整合ik分詞器

lucene由於是jar工具包,如果要在使用lucene的環境下使用ik分詞器,只需匯入對應jar,做一些配置就OK.但是ES不是工具包了,是伺服器.怎麼整合呢? 以外掛的方式整合ES伺服器,客戶端只需告訴我們某個欄位要用這

Elasticsearch 自定義多個分析器

分析器(Analyzer) Elasticsearch 無論是內建分析器還是自定義分析器，都由三部分組成：字元過濾器(Character Filters)、分詞器(Tokenizer)、詞元過濾器(Token Filters)。分析器Analyzer工作流

（2）ElasticSearch在linux環境中整合IK分詞器

1.簡介 ElasticSearch預設自帶的分詞器，是標準分詞器，對英文分詞比較友好，但是對中文，只能把漢字一個個拆分。而elasticsearch-analysis-ik分詞器能針對中文詞項顆粒度進行粗細提取，所以對中文搜尋是比較友好的。IK分詞器有兩種型別ik_smart和ik_max_word，前者提

ckeditor添加自定義按鈕整合swfupload實現批量上傳圖片

下載了解 nbsp 文件 mouseover 去掉 dial size pro ckeditor添加自定義按鈕整合swfupload實現批量上傳圖片給ckeditor添加自定義按鈕，由於ckeditor只能上傳一張圖片，如果要上傳多張圖片就要結合ckfinder，而ckf

Python 結巴分詞關鍵詞抽取分析

等於範圍分類問題 urn post bre 依然信息檢索有意關鍵詞抽取就是從文本裏面把跟這篇文檔意義最相關的一些詞抽取出來。這個可以追溯到文獻檢索初期，當時還不支持全文搜索的時候，關鍵詞就可以作為搜索這篇論文的詞語。因此，目前依然可以在論文中看到關鍵詞這一項。

Python中結巴分詞使用手記

img 3年方法封裝 python token sys.path 裝飾 arp mage 結巴分詞方法封裝類 from __future__ import unicode_literals import sys sys.path.append("../")

python中文分詞，使用結巴分詞對python進行分詞

php 分詞在采集美女站時,需要對關鍵詞進行分詞,最終采用的是python的結巴分詞方法.中文分詞是中文文本處理的一個基礎性工作，結巴分詞利用進行中文分詞。其基本實現原理有三點：基於Trie樹結構實現高效的詞圖掃描，生成句子中漢字所有可能成詞情況所構成的有向無環圖（DAG)采用了動態規劃查找最大概率

ElasticSearch自定義分析器-整合結巴分詞外掛

相關推薦