1. 程式人生 > >ElasticSearch自定義分析器-整合結巴分詞外掛

ElasticSearch自定義分析器-整合結巴分詞外掛

關於結巴分詞 ElasticSearch 外掛:

https://github.com/huaban/elasticsearch-analysis-jieba

該外掛由huaban開發。支援Elastic Search 版本<=2.3.5。

結巴分詞分析器

結巴分詞外掛提供3個分析器:jieba_index、jieba_search和jieba_other。

  1. jieba_index: 用於索引分詞,分詞粒度較細;
  2. jieba_search: 用於查詢分詞,分詞粒度較粗;
  3. jieba_other: 全形轉半形、大寫轉小寫、字元分詞;

使用jieba_index或jieba_search分析器,可以實現基本的分詞效果。

以下是最小配置示例:

{
    "mappings": {
        "test": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "jieba_index",
                    "search_analyzer": "
jieba_index" } } } } }

在生產化境中,因為業務的需要,需要考慮實現以下功能:

  1. 支援同義詞;
  2. 支援字元過濾器;

結巴外掛提供的分析器jieba_index、jieba_search無法實現以上功能。

自定義分析器

當jieba_index、jieba_search分析器不滿足生成環境的需求時,我們可以使用自定義分析器來解決以上問題。

分析器是由字元過濾器,分詞器,詞元過濾器組成的。

一個分詞器允許包含多個字元過濾器+一個分詞器+多個詞元過濾器。

因業務的需求,我們需要使用對映字元過濾器來實現分詞前某些字串的替換操作。如將使用者輸入的c#替換為csharp,c++替換為cplus。

下面逐一介紹分析器各個組成部分。

1. 對映字元過濾器Mapping Char Filter

這個是Elastic Search內建的對映字元過濾器,位於settings –> analysis -> char_filter下:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings": [
                      "c# => csharp",
                      "c++ => cplus"
                  ]
                }
            }
        }
    }
}

也可以通過檔案載入字元對映表。

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings_path": "mappings.txt"
                }
            }
        }
    }
}

檔案預設存放config目錄下,即config/ mappings.txt。

2. 結巴分詞詞元過濾器JiebaTokenFilter

JiebaTokenFilter接受一個SegMode引數,該引數有兩個可選值:Index和Search。

我們預先定義兩個詞元過濾器:jieba_index_filter和jieba_search_filter。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
        }
    }
}

 這兩個詞元過濾器將分別用於索引分析器和查詢分析器。

3. stop 停用詞詞元過濾器

因分詞詞元過濾器JiebaTokenFilter並不處理停用詞。因此我們在自定義分析器時,需要定義停用詞詞元過濾器來處理停用詞。 

Elastic Search提供了停用詞詞元過濾器,我們可以這樣來定義:

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type":       "stop",
                    "stopwords": ["and", "is", "the"]
                }
            }
        }
    }
}

也可以通過檔案載入停用詞列表 

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                }
            }
        }
    }
}

檔案預設存放config目錄下,即config/ stopwords.txt。

4. synonym 同義詞詞元過濾器

我們使用ElasticSearch內建同義詞詞元過濾器來實現同義詞的功能。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "stopwords": [
                      "中文,漢語,漢字"
                  ]
                }
            }
        }
    }
}

如果同義詞量比較大時,推薦使用檔案的方式載入同義詞庫。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                }
            }
        }
    }
}

5. 重新定義分析器jieba_index和jieba_search

Elastic Search支援多級分詞,我們使用whitespace分詞作為分詞器;並在詞元過濾器加入定義好的Jiebie分詞詞元過濾器:jieba_index_filter和jieba_search_filter。 

PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

注意,上面分析器的命名依然使用jieba_index和jieba_search,以便覆蓋結巴分詞外掛提供的分析器。

當存在多個同名的分析器時,Elastic Search會優先使用索引配置中定義的分析器。

這樣在程式碼呼叫層面便無需再更改。 

下面是完整的配置:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                  "mappings_path": "mappings.txt"
                }
            }
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                },
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                },
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

 參考資料:

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html

http://www.tuicool.com/articles/eUJJ3qF