ES基礎（二十二）多語言及中文分詞與檢索

阿新 • • 發佈：2020-12-27

課程demo

來到楊過曾經生活過的地方，小龍女動情地說：“我也想過過過兒過過的生活。”
你也想犯範范瑋琪犯過的錯嗎
校長說衣服上除了校徽別別別的
這幾天天天天氣不好
我背有點駝，麻麻說“你的背得背背背背佳

#stop word

DELETE my_index
PUT /my_index/_doc/1
{ "title": "I'm happy for this fox" }

PUT /my_index/_doc/2
{ "title": "I'm not happy about my fox problem" }


POST my_index 
/_search
{
  "query": {
    "match": {
      "title": "not happy fox"
    }
  }
}


#雖然通過使用 english （英語）分析器，使得匹配規則更加寬鬆，我們也因此提高了召回率，但卻降低了精準匹配文件的能力。為了獲得兩方面的優勢，我們可以使用multifields（多欄位）對 title 欄位建立兩次索引： 一次使用 `english`（英語）分析器，另一次使用 `standard`（標準）分析器:

DELETE my_index

PUT /my_index
{
  "mappings": {
    "blog": {
       
"properties": {
        "title": {
          "type": "string",
          "analyzer": "english"
        }
      }
    }
  }
}

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "english": {
               
"type":     "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}


PUT /my_index/blog/1
{ "title": "I'm happy for this fox" }

PUT /my_index/blog/2
{ "title": "I'm not happy about my fox problem" }

GET /_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "not happy foxes",
      "fields": [ "title", "title.english" ]
    }
  }
}


#安裝外掛
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
#安裝外掛
bin/elasticsearch install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip




#ik_max_word
#ik_smart
#hanlp: hanlp預設分詞
#hanlp_standard: 標準分詞
#hanlp_index: 索引分詞
#hanlp_nlp: NLP分詞
#hanlp_n_short: N-最短路分詞
#hanlp_dijkstra: 最短路分詞
#hanlp_crf: CRF分詞（在hanlp 1.6.6已開始廢棄）
#hanlp_speed: 極速詞典分詞

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text": ["劍橋分析公司多位高管對臥底記者說，他們確保了唐納德·特朗普在總統大選中獲勝"]

}     

#Pinyin
PUT /artists/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}


GET /artists/_analyze
{
  "text": ["劉德華 張學友 郭富城 黎明 四大天王"],
  "analyzer": "user_name_analyzer"
}

相關資源

Elasticsearch IK分詞外掛https://github.com/medcl/elasticsearch-analysis-ik/releases
Elasticsearch hanlp 分詞外掛https://github.com/KennFalcon/elasticsearch-analysis-hanlp
分詞演算法綜述https://zhuanlan.zhihu.com/p/50444885

一些分詞工具，供參考：

中科院計算所NLPIRhttp://ictclas.nlpir.org/nlpir/
ansj分詞器https://github.com/NLPchina/ansj_seg
哈工大的LTPhttps://github.com/HIT-SCIR/ltp
清華大學THULAChttps://github.com/thunlp/THULAC
斯坦福分詞器https://nlp.stanford.edu/software/segmenter.shtml
Hanlp分詞器https://github.com/hankcs/HanLP
結巴分詞https://github.com/yanyiwu/cppjieba
KCWS分詞器(字嵌入+Bi-LSTM+CRF)https://github.com/koth/kcws
ZParhttps://github.com/frcchang/zpar/releases
IKAnalyzerhttps://github.com/wks/ik-analyzer

ES基礎（二十二）多語言及中文分詞與檢索

課程demo 來到楊過曾經生活過的地方，小龍女動情地說：“我也想過過過兒過過的生活。”

ES基礎（四十二）文件的父子關係

課程demo PUT my_blogs/_doc/comment1?routing=blog1 { \"comment\":\"I am learning ELK\", \"username\":\"Jack\",

ES基礎（五十二）Hot & Warm 架構與 Shard Filtering

課程程式碼 # 標記一個 Hot 節點 bin/elasticsearch-E node.name=hotnode -E cluster.name=geektime -E path.data=hot_data -E node.attr.my_node_type=hot

ES基礎（三十四）排序及Doc Values & Fielddata

課程demo #單欄位排序 POST /kibana_sample_data_ecommerce/_search { \"size\": 5, \"query\": { \"match_all\": {

ES基礎（三十五）分頁與遍歷 - From, Size, Search_after & Scroll API

課程Demo POST tmdb/_search { \"from\": 10000, \"size\": 1, \"query\": { \"match_all\": { } } } #Scroll API DELETE users

ES基礎（三十八）Pipeline 聚合分析

課程 demo DELETE employees PUT /employees/_bulk { \"index\" : {\"_id\" : \"1\" } } { \"name\" : \"Emma\",\"age\":32,\"job\":\"Product Manager\",\"gender\":\"female\",\"salary\":35000 }

ES基礎（三十七）Bucket & Metric Aggregation

demos DELETE /employees PUT /employees/ { \"mappings\" : { \"properties\" : { \"age\" : { \"type\" : \"integer\"

ES基礎（四十一）物件及Nested物件

課程demos DELETE blog # 設定blog的 Mapping PUT /blog { \"mappings\": { \"properties\": { \"content\": {

ES基礎（四十四）Ingest Pipeline & Painless Script

#########Demo for Pipeline############### DELETE tech_blogs #Blog資料，包含3個欄位，tags用逗號間隔

ES基礎（四十六）Elasticsearch 資料建模最佳實踐

課程demo ###### Cookie Service ##索引資料，dynamic mapping 會不斷加入新增欄位 PUT cookie_service/_doc/1

ES基礎（四十八）叢集身份認證與使用者鑑權

如何為叢集啟用X-Pack Security 如何為內建使用者設定密碼設定 Kibana與ElasticSearch通訊鑑權

ES基礎（四十九）叢集內部安全通訊

課程demo # 生成證書 # 為您的Elasticearch叢集建立一個證書頒發機構。例如，使用elasticsearch-certutil ca命令：

ES基礎（五十七）叢集寫效能優化

課程demo { \"template\": \"logs-*\", \"settings\": { \"index.indexing.slowlog.threshold.index.debug\": \"2s\",

ES基礎（六十六）使用 shrink與rolloverAPI有效的管理索引

課程demo # 開啟關閉索引 DELETE test #檢視索引是否存在 HEAD test PUT test/_doc/1 { \"key\":\"value\"

Hadoop基礎（五十七）：其他面試題ES（二）

來源：https://mp.weixin.qq.com/s/MU87hW3W2S1Fi6CqnnXAGA 問題列表和答案來自國外部落格（原文答案不準確，有錯誤），為避免誤導，我對每個問題做了屬於自己的理解和解答。

Hadoop基礎（五十二）：企業級調優（二）

4 資料傾斜 4.1 合理設定 Map 數 1）通常情況下，作業會通過 input 的目錄產生一個或者多個 map 任務。

Flink基礎（三十二）：FLINK SQL(八)DESCRIBE 語句

DESCRIBE 語句用來描述一張表或者檢視的 Schema。執行 DESCRIBE 語句 DESCRIBE 語句可以通過TableEnvironment的executeSql()執行，也可以在SQL CLI中執行 DROP 語句。若 DESCRIBE 操作執行成功，executeSql() 方法

Flink基礎（三十六）：FLINK SQL(十二) 函式（一）概述

0 函式 Flink 允許使用者在 Table API 和 SQL 中使用函式進行資料的轉換。 1 函式型別

Flink基礎（三十九）：FLINK SQL(十五) 函式（四）自定義函式（二）

1 標量函式自定義標量函式可以把 0 到多個標量值對映成 1 個標量值，資料型別裡列出的任何資料型別都可作為求值方法的引數和返回值型別。

Flink基礎（四十二）：FLINK SQL(十八) 配置

0 配置 Table 和 SQL API 的預設配置能夠確保結果準確，同時也提供可接受的效能。

ES基礎（二十二） 多語言及中文分詞與檢索

課程demo

相關資源

一些分詞工具，供參考：

相關推薦

ES基礎（二十二）多語言及中文分詞與檢索