1. 程式人生 > >為elastic新增中文分詞

為elastic新增中文分詞

新增中文分詞 可以可以自己整合中文分片語件,medcl為es寫了三個中文分詞外掛,一個是ik的,一個是mmseg的,一個是pinyin4j的。 下面介紹這三個外掛與es的整合: 1.ik與es的整合 1.1下載 1.2編譯 解壓下載的elasticsearch-analysis-ik-1.2.6.zip 編譯 在cmd下編譯 Windows開始選單---》執行-----》cmd-----》回車 e: cd  E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-ik-1.2.6 E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-ik-1.2.6>E:\j2ee\maven\apache-maven-3.1.1-bin\apache-maven-3.1.1\bin\mvn package

1.3配置 1.3.1在%ES_HOME%目錄下新建目錄/plugins/analysis-ik mkdir   -p   /usr/local/search/elasticsearch-1.3.1/plugins/analysis-ik
1.3.2將elasticsearch-analysis-ik-1.2.6.jar拷貝到目錄/usr/local/search/elasticsearch-1.3.1 /plugins/analysis-ik下
1.3.3將解壓elasticsearch-analysis-ik-1.2.6.zip後的config/ik目錄拷貝到/usr/local/search/elasticsearch-1.3.1 /config/目錄下

1.3.4修改elasticsearch.yml vi   /usr/local/search/elasticsearch-1.3.1 /config/elasticsearch.yml index:   analysis:     analyzer:       ik:           alias: [news_analyzer_ik,ik_analyzer]           type: org.elasticsearch.index.analysis.IkAnalyzerProvider index.analysis.analyzer.default.type : "ik"
1.3.5IKAnalyzer.cfg.xml    可以在/usr/local/search/elasticsearch-1.3.1/config/ik /IKAnalyzer.cfg.xml中配置一些擴充套件的詞庫字典,以及一些停用詞詞庫字典 vi   /usr/local/search/elasticsearch-1.3.1/config/ik /IKAnalyzer.cfg.xml <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "
http://java.sun.com/dtd/properties.dtd
"> 
<properties> 
<comment>IK Analyzer 擴充套件配置</comment>
<!--使用者可以在這裡配置自己的擴充套件字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry> 
<!--使用者可以在這裡配置自己的擴充套件停止詞字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry> 
</properties> 1.3.6 重啟es  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch stop  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch start 1.4測試 1.4.1建立mapping,指定使用中文分詞器  /**   * 建立型別對映關係,使用中文分詞器   * 注意:在定義mapping之前,需要先建立一個index庫   * @param client   * @throws IOException   */  public static void mapping4CN(Client client) throws IOException{   XContentBuilder mapping=XContentFactory.jsonBuilder().startObject().startObject("fulltext")   .startObject("_all").field("indexAnalyzer","ik").field("searchAnalyzer","ik").field("term_vector","no").field("store","false").endObject()   .startObject("properties")   .startObject("content").field("type","string").field("store","no").field("term_vector","with_positions_offsets").field("indexAnalyzer","ik").field("searchAnalyzer","ik").field("include_in_all","true").field("boost",8).endObject()   .endObject()   .endObject().endObject();   System.out.println(mapping.string());   //注意:在定義mapping之前,需要先建立一個index庫   //建立索引庫   if(!indexExist(client,"cnindex")){      CreateIndexResponse ciresponse=client.admin().indices().prepareCreate("cnindex").execute().actionGet();      System.out.println("CreateIndexResponse---->"+ciresponse.isAcknowledged());   }   //建立Mapping(需要指定索引庫名稱)   PutMappingRequestBuilder pmrbuilder=client.admin().indices().preparePutMapping("cnindex").setType("fulltext").setSource(mapping);   PutMappingResponse pmResponse=pmrbuilder.execute().actionGet();   System.out.println("PutMappingResponse---->"+pmResponse.isAcknowledged());  } 1.4.2建立中文索引  /**   * 為中文內容建立索引   * @param client   * @throws IOException   */   public static void createIndex4CN(Client client) throws IOException{    XContentBuilder doc1=XContentFactory.jsonBuilder().startObject()    .field("content", "中韓漁警衝突調查:韓警平均每天扣1艘中國漁船")    .endObject();    XContentBuilder doc2=XContentFactory.jsonBuilder().startObject()    .field("content", "美國留給伊拉克的是個爛攤子嗎")    .endObject();    XContentBuilder doc3=XContentFactory.jsonBuilder().startObject()    .field("content", "公安部:各地校車將享最高路權")    .endObject();    XContentBuilder doc4=XContentFactory.jsonBuilder().startObject()    .field("content", "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首")    .endObject();    /**     * 其中第一個引數productIndex為索引庫名,一個es叢集中可以有多個索引庫。     * 第二個引數productIndexType為索引型別,是用來區分同索引庫下不同型別的資料的,一個索引庫下可以有多個索引型別。     * 第三個引數productIndexId為document的id     */    IndexRequestBuilder irbuilder1= client.prepareIndex("cnindex", "cnindextype","cnindexid1").setRefresh(true).setSource(doc1);    IndexRequestBuilder irbuilder2= client.prepareIndex("cnindex", "cnindextype","cnindexid2").setRefresh(true).setSource(doc2);    IndexRequestBuilder irbuilder3= client.prepareIndex("cnindex", "cnindextype","cnindexid3").setRefresh(true).setSource(doc3);    IndexRequestBuilder irbuilder4= client.prepareIndex("cnindex", "cnindextype","cnindexid4").setRefresh(true).setSource(doc4);    BulkRequestBuilder brbuilder=client.prepareBulk();    brbuilder.add(irbuilder1);    brbuilder.add(irbuilder2);    brbuilder.add(irbuilder3);    brbuilder.add(irbuilder4);    BulkResponse response=brbuilder.execute().actionGet();    System.out.println(response);   } 1.4.3執行中文搜尋    /**     * 執行中文搜尋     * @param client     */    public static void search4CN(Client client){   //構造查詢條件     //TermQuery     QueryBuilder qb1=QueryBuilders.termQuery("content", "伊拉克");     /**     QueryBuilder qb2=QueryBuilders.boolQuery().must(QueryBuilders.termQuery("content", "中國"))     .must(QueryBuilders.termQuery("content", "中國"))     .mustNot(QueryBuilders.termQuery("onSale", false))     .should(QueryBuilders.termQuery("type", 1));     QueryBuilder db3=QueryBuilders.filteredQuery(QueryBuilders.termQuery("content", "中國"),       FilterBuilders.rangeFilter("price").from(30.0).to(500.0).includeLower(true).includeUpper(false));       */    SearchResponse response= client.prepareSearch("cnindex").setTypes("cnindextype").setQuery(qb1).setFrom(0).setSize(15).addHighlightedField("content").setHighlighterPreTags("<span style=\"color:red\">").setHighlighterPostTags("</span>").setExplain(true).execute().actionGet();    SearchHits shits=response.getHits();    SearchHit[] shs= shits.hits();    for(SearchHit sh: shs ){    String content=(String) sh.getSource().get("content");    System.out.println("content="+content);    }  } 2.mmseg與es的整合 2.1下載 2.2編譯 解壓下載的elasticsearch-analysis-mmseg-1.2.0.zip 編譯 在cmd下編譯 Windows開始選單---》執行-----》cmd-----》回車 e: cd  E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-mmseg-1.2.0 E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-mmseg-1.2.0>E:\j2ee\maven\apache-maven-3.1.1-bin\apache-maven-3.1.1\bin\mvn package



2.3配置 2.3.1在%ES_HOME%目錄下新建目錄/plugins/analysis-mmseg mkdir   -p   /usr/local/search/elasticsearch-1.3.1/plugins/analysis-mmseg
2.3.2將elasticsearch-analysis-mmseg-1.2.0.jar拷貝到目錄/usr/local/search/elasticsearch-1.3.1/plugins/analysis-mmseg下
2.3.3將解壓elasticsearch-analysis-mmseg-1.2.0.zip後的config\mmseg目錄拷貝到/usr/local/search/elasticsearch-1.3.1/config/目錄下
2.3.4修改elasticsearch.yml vi   /usr/local/search/elasticsearch-1.3.1 /config/elasticsearch.yml index:   analysis:     analyzer:       ik:           alias: [news_analyzer_ik,ik_analyzer]           type: org.elasticsearch.index.analysis.IkAnalyzerProvider       mmseg:           alias: [news_analyzer, mmseg_analyzer]           type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider
2.3.5重啟es  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch stop  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch start
2.4測試 2.4.1建立mapping,指定使用中文分詞器  /**   * 建立型別對映關係,使用中文分詞器mmseg   * 注意:在定義mapping之前,需要先建立一個index庫   * @param client   * @throws IOException   */  public static void mapping4CN_MMSEG(Client client) throws IOException{   XContentBuilder mapping=XContentFactory.jsonBuilder().startObject().startObject("fulltext_mmseg")   .startObject("_all").field("indexAnalyzer","mmseg").field("searchAnalyzer","mmseg").field("term_vector","no").field("store","true").endObject()   .startObject("properties")   .startObject("content").field("type","string").field("store","yes").field("term_vector","with_positions_offsets").field("indexAnalyzer","mmseg").field("searchAnalyzer","mmseg").field("include_in_all","true").field("boost",8).endObject()   .endObject()   .endObject().endObject();   System.out.println(mapping.string());   //注意:在定義mapping之前,需要先建立一個index庫   //建立索引庫   if(!indexExist(client,"cnindex_mmseg")){      CreateIndexResponse ciresponse=client.admin().indices().prepareCreate("cnindex_mmseg").execute().actionGet();      System.out.println("CreateIndexResponse---->"+ciresponse.isAcknowledged());   }   //建立Mapping(需要指定索引庫名稱)   PutMappingRequestBuilder pmrbuilder=client.admin().indices().preparePutMapping("cnindex_mmseg").setType("fulltext_mmseg").setSource(mapping);   PutMappingResponse pmResponse=pmrbuilder.execute().actionGet();   System.out.println("PutMappingResponse---->"+pmResponse.isAcknowledged());  } 2.4.2建立中文索引   /**    * 為中文內容建立索引    * @param client    * @throws IOException    */    public static void createIndex4CN_MMSEG(Client client) throws IOException{     XContentBuilder doc1=XContentFactory.jsonBuilder().startObject()     .field("content", "中韓漁警衝突調查:韓警平均每天扣1艘中國漁船")     .endObject();     XContentBuilder doc2=XContentFactory.jsonBuilder().startObject()     .field("content", "美國留給伊拉克的是個爛攤子嗎")     .endObject();     XContentBuilder doc3=XContentFactory.jsonBuilder().startObject()     .field("content", "公安部:各地校車將享最高路權")     .endObject();     XContentBuilder doc4=XContentFactory.jsonBuilder().startObject()     .field("content", "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首")     .endObject();     /**      * 其中第一個引數productIndex為索引庫名,一個es叢集中可以有多個索引庫。      * 第二個引數productIndexType為索引型別,是用來區分同索引庫下不同型別的資料的,一個索引庫下可以有多個索引型別。      * 第三個引數productIndexId為document的id      */     IndexRequestBuilder irbuilder1= client.prepareIndex("cnindex_mmseg", "cnindextype_mmseg","cnindexid_mmseg1").setRefresh(true).setSource(doc1);     IndexRequestBuilder irbuilder2= client.prepareIndex("cnindex_mmseg", "cnindextype_mmseg","cnindexid_mmseg2").setRefresh(true).setSource(doc2);     IndexRequestBuilder irbuilder3= client.prepareIndex("cnindex_mmseg", "cnindextype_mmseg","cnindexid_mmseg3").setRefresh(true).setSource(doc3);     IndexRequestBuilder irbuilder4= client.prepareIndex("cnindex_mmseg", "cnindextype_mmseg","cnindexid_mmseg4").setRefresh(true).setSource(doc4);     BulkRequestBuilder brbuilder=client.prepareBulk();     brbuilder.add(irbuilder1);     brbuilder.add(irbuilder2);     brbuilder.add(irbuilder3);     brbuilder.add(irbuilder4);     BulkResponse response=brbuilder.execute().actionGet();     System.out.println(response);    } 2.4.3執行中文搜尋    /**     * 執行中文搜尋     * @param client     */    public static void search4CN_MMSEG(Client client){   //構造查詢條件     //TermQuery     QueryBuilder qb1=QueryBuilders.termQuery("content", "校車");     /**     QueryBuilder qb2=QueryBuilders.boolQuery().must(QueryBuilders.termQuery("content", "中國"))     .must(QueryBuilders.termQuery("content", "中國"))     .mustNot(QueryBuilders.termQuery("onSale", false))     .should(QueryBuilders.termQuery("type", 1));     QueryBuilder db3=QueryBuilders.filteredQuery(QueryBuilders.termQuery("content", "中國"),       FilterBuilders.rangeFilter("price").from(30.0).to(500.0).includeLower(true).includeUpper(false));       */    SearchResponse response= client.prepareSearch("cnindex_mmseg").setTypes("cnindextype_mmseg").setQuery(qb1).setFrom(0).setSize(15).addHighlightedField("content").setHighlighterPreTags("<span style=\"color:red\">").setHighlighterPostTags("</span>").setExplain(true).execute().actionGet();    SearchHits shits=response.getHits();    SearchHit[] shs= shits.hits();    for(SearchHit sh: shs ){    String content=(String) sh.getSource().get("content");    System.out.println("content="+content);    }  } 3.pinyin4j與es的整合 3.1下載 3.2編譯
解壓下載的 編譯 在cmd下編譯 Windows開始選單---》執行-----》cmd-----》回車 e: cd  E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-pinyin-1.2.2 E:\j2ee\search\中文分詞器\for_es\elasticsearch-analysis-pinyin-1.2.2>E:\j2ee\maven\apache-maven-3.1.1-bin\apache-maven-3.1.1\bin\mvn package


3.3配置 3.3.1在%ES_HOME%目錄下新建目錄/plugins/analysis-pinyin mkdir   -p   /usr/local/search/elasticsearch-1.3.1/plugins/analysis-pinyin
3.3.2將lib/pinyin4j-2.5.0.jar和target/elasticsearch-analysis-pinyin-1.2.2.jar拷貝到目錄/usr/local/search/elasticsearch-1.3.1/plugins/analysis-pinyin下


2.3.4修改elasticsearch.yml vi   /usr/local/search/elasticsearch-1.3.1/config/elasticsearch.yml
index:   analysis:     analyzer:       ik:           alias: [news_analyzer_ik,ik_analyzer]           type: org.elasticsearch.index.analysis.IkAnalyzerProvider       mmseg:           alias: [news_analyzer_mmseg, mmseg_analyzer]           type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider       pinyin:           alias: [news_analyzer_pinyin, pinyin_analyzer]           type: org.elasticsearch.index.analysis.PinyinAnalyzerProvider index.analysis.analyzer.default.type : "ik"
3.3.5重啟es  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch stop  /usr/local/search/elasticsearch-1.3.1/bin/service/elasticsearch start
3.4測試 2.4.1建立mapping,指定使用中文分詞器 3.4.2建立中文索引 3.4.3執行中文搜尋