ES基礎(二十二) 多語言及中文分詞與檢索
阿新 • • 發佈:2020-12-27
課程demo
-
來到楊過曾經生活過的地方,小龍女動情地說:“我也想過過過兒過過的生活。”
-
你也想犯範范瑋琪犯過的錯嗎
-
校長說衣服上除了校徽別別別的
-
這幾天天天天氣不好
-
我背有點駝,麻麻說“你的背得背背背背佳
#stop word DELETE my_index PUT /my_index/_doc/1 { "title": "I'm happy for this fox" } PUT /my_index/_doc/2 { "title": "I'm not happy about my fox problem" } POST my_index/_search { "query": { "match": { "title": "not happy fox" } } } #雖然通過使用 english (英語)分析器,使得匹配規則更加寬鬆,我們也因此提高了召回率,但卻降低了精準匹配文件的能力。為了獲得兩方面的優勢,我們可以使用multifields(多欄位)對 title 欄位建立兩次索引: 一次使用 `english`(英語)分析器,另一次使用 `standard`(標準)分析器: DELETE my_index PUT /my_index { "mappings": { "blog": {"properties": { "title": { "type": "string", "analyzer": "english" } } } } } PUT /my_index { "mappings": { "blog": { "properties": { "title": { "type": "string", "fields": { "english": {"type": "string", "analyzer": "english" } } } } } } } PUT /my_index/blog/1 { "title": "I'm happy for this fox" } PUT /my_index/blog/2 { "title": "I'm not happy about my fox problem" } GET /_search { "query": { "multi_match": { "type": "most_fields", "query": "not happy foxes", "fields": [ "title", "title.english" ] } } } #安裝外掛 ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip #安裝外掛 bin/elasticsearch install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip #ik_max_word #ik_smart #hanlp: hanlp預設分詞 #hanlp_standard: 標準分詞 #hanlp_index: 索引分詞 #hanlp_nlp: NLP分詞 #hanlp_n_short: N-最短路分詞 #hanlp_dijkstra: 最短路分詞 #hanlp_crf: CRF分詞(在hanlp 1.6.6已開始廢棄) #hanlp_speed: 極速詞典分詞 POST _analyze { "analyzer": "hanlp_standard", "text": ["劍橋分析公司多位高管對臥底記者說,他們確保了唐納德·特朗普在總統大選中獲勝"] } #Pinyin PUT /artists/ { "settings" : { "analysis" : { "analyzer" : { "user_name_analyzer" : { "tokenizer" : "whitespace", "filter" : "pinyin_first_letter_and_full_pinyin_filter" } }, "filter" : { "pinyin_first_letter_and_full_pinyin_filter" : { "type" : "pinyin", "keep_first_letter" : true, "keep_full_pinyin" : false, "keep_none_chinese" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true, "trim_whitespace" : true, "keep_none_chinese_in_first_letter" : true } } } } } GET /artists/_analyze { "text": ["劉德華 張學友 郭富城 黎明 四大天王"], "analyzer": "user_name_analyzer" }
相關資源
-
Elasticsearch IK分詞外掛https://github.com/medcl/elasticsearch-analysis-ik/releases
-
Elasticsearch hanlp 分詞外掛https://github.com/KennFalcon/elasticsearch-analysis-hanlp
一些分詞工具,供參考:
- 中科院計算所NLPIRhttp://ictclas.nlpir.org/nlpir/
- ansj分詞器https://github.com/NLPchina/ansj_seg
- 哈工大的LTPhttps://github.com/HIT-SCIR/ltp
- 清華大學THULAChttps://github.com/thunlp/THULAC
- 斯坦福分詞器https://nlp.stanford.edu/software/segmenter.shtml
- Hanlp分詞器https://github.com/hankcs/HanLP
- 結巴分詞https://github.com/yanyiwu/cppjieba
- KCWS分詞器(字嵌入+Bi-LSTM+CRF)https://github.com/koth/kcws
- ZParhttps://github.com/frcchang/zpar/releases
- IKAnalyzerhttps://github.com/wks/ik-analyzer