HM-SpringCloud微服務系列7.2【自動補全】
阿新 • • 發佈:2022-03-30
自動補全需求說明
當用戶在搜尋框輸入字元時,我們應該提示出與該字元有關的搜尋項,如圖:
這種根據使用者輸入的字母,提示完整詞條的功能,就是自動補全了。
因為需要根據拼音字母來推斷,因此要用到拼音分詞功能
1 拼音分詞器
1.1 拼音分詞器介紹
要實現根據字母做補全,就必須對文件按照拼音分詞。在GitHub上恰好有elasticsearch的拼音分詞外掛。地址:https://github.com/medcl/elasticsearch-analysis-pinyin
安裝方式與IK分詞器一樣,分三步:
①解壓
②上傳到虛擬機器中,elasticsearch的plugin目錄
③重啟elasticsearch
④測試
詳細安裝步驟可以參考IK分詞器的安裝過程 https://www.cnblogs.com/yppah/p/15936823.html。
1.2 拼音分詞器安裝
1.3 測試
點選檢視程式碼
{ "tokens" : [ { "token" : "ru", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "rjjdhbcm", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "jia", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "jiu", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 }, { "token" : "dian", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 3 }, { "token" : "hai", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 4 }, { "token" : "bu", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 5 }, { "token" : "cuo", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 6 }, { "token" : "ma", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 7 } ] }
2 自定義分詞器
2.1 預設的拼音分詞器
- 預設的拼音分詞器會將每個漢字單獨分為拼音,而我們希望的是每個詞條形成一組拼音,需要對拼音分詞器做個性化定製,形成自定義分詞器。
- elasticsearch中分詞器(analyzer)的組成包含三部分:
- character filters:在tokenizer之前對文字進行處理。例如刪除字元、替換字元
- tokenizer:將文字按照一定的規則切割成詞條(term)。例如keyword,就是不分詞;還有ik_smart
- tokenizer filter:將tokenizer輸出的詞條做進一步處理。例如大小寫轉換、同義詞處理、拼音處理等
- 文件分詞時會依次由這三部分來處理文件:
2.2 自定義分詞器
2.2.1
點選檢視程式碼
# 自定義拼音分詞器
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
POST /test/_analyze
{
"text": ["如家酒店還不錯"],
"analyzer": "my_analyzer"
}
點選檢視程式碼
{
"tokens" : [
{
"token" : "如家",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "rujia",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "rj",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "酒店",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "jiudian",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "jd",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "還不",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "haibu",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "hb",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "不錯",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "bucuo",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "bc",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}
]
}
2.2.2
POST /test/_doc/1
{
"id": 1,
"name": "獅子"
}
POST /test/_doc/2
{
"id": 2,
"name": "蝨子"
}
GET /test/_search
{
"query": {
"match": {
"name": "掉入獅子籠咋辦"
}
}
}
ok
?
原因分析
解決辦法
解決
ok
2.3 小結
- 如何使用拼音分詞器?
- ①下載pinyin分詞器
- ②解壓並放到elasticsearch的plugin目錄
- ③重啟即可
- 如何自定義分詞器?
- ①建立索引庫時,在settings中配置,可以包含三部分
- ②character filter
- ③tokenizer
- ④filter
- 拼音分詞器注意事項?
- 為了避免搜尋到同音字,搜尋時不要使用拼音分詞器
3 自動補全查詢
- elasticsearch提供了Completion Suggester查詢來實現自動補全功能。這個查詢會匹配以使用者輸入內容開頭的詞條並返回。為了提高補全查詢的效率,對於文件中欄位的型別有一些約束:
- 參與補全查詢的欄位必須是completion型別。
- 欄位的內容一般是用來補全的多個詞條形成的陣列
- 比如,一個這樣的索引庫:
# 建立索引庫
PUT test2
{
"mappings": {
"properties": {
"title":{
"type": "completion"
}
}
}
}
- 然後插入下面的資料:
# 示例資料
POST test2/_doc
{
"title": ["Sony", "WH-1000XM3"]
}
POST test2/_doc
{
"title": ["SK-II", "PITERA"]
}
POST test2/_doc
{
"title": ["Nintendo", "switch"]
}
- 查詢的DSL語句如下:
# 自動補全查詢
GET /test2/_search
{
"suggest": {
"title_suggest": {
"text": "s", # 關鍵字
"completion": {
"field": "title", # 補全查詢的欄位
"skip_duplicates": true, # 跳過重複的
"size": 10 # 獲取前10條結果
}
}
}
}
- 測試
點選檢視程式碼
{
"took" : 565,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"title_suggest" : [
{
"text" : "s",
"offset" : 0,
"length" : 1,
"options" : [
{
"text" : "SK-II",
"_index" : "test2",
"_type" : "_doc",
"_id" : "xuqv2n8BUtPonQDctNZG",
"_score" : 1.0,
"_source" : {
"title" : [
"SK-II",
"PITERA"
]
}
},
{
"text" : "Sony",
"_index" : "test2",
"_type" : "_doc",
"_id" : "xeqv2n8BUtPonQDcS9aX",
"_score" : 1.0,
"_source" : {
"title" : [
"Sony",
"WH-1000XM3"
]
}
},
{
"text" : "switch",
"_index" : "test2",
"_type" : "_doc",
"_id" : "x-qv2n8BUtPonQDcyNYX",
"_score" : 1.0,
"_source" : {
"title" : [
"Nintendo",
"switch"
]
}
}
]
}
]
}
}
4 實現酒店搜尋框自動補全
- 現在,我們的hotel索引庫還沒有設定拼音分詞器,需要修改索引庫中的配置。但是我們知道索引庫是無法修改的,只能刪除然後重新建立。
- 另外,我們需要新增一個欄位,用來做自動補全,將brand、suggestion、city等都放進去,作為自動補全的提示。
4.1 修改酒店對映結構
4.1.1 檢視酒店資料原索引表
4.1.2 修改酒店索引表
點選檢視程式碼
# 酒店資料索引庫
PUT /hotel
{
"settings": {
"analysis": {
"analyzer": {
"text_anlyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"name":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart",
"copy_to": "all"
},
"address":{
"type": "keyword",
"index": false
},
"price":{
"type": "integer"
},
"score":{
"type": "integer"
},
"brand":{
"type": "keyword",
"copy_to": "all"
},
"city":{
"type": "keyword"
},
"starName":{
"type": "keyword"
},
"business":{
"type": "keyword",
"copy_to": "all"
},
"location":{
"type": "geo_point"
},
"pic":{
"type": "keyword",
"index": false
},
"all":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart"
},
"suggestion":{
"type": "completion",
"analyzer": "completion_analyzer"
}
}
}
}
text_anlyzer用於全文檢索
completion_analyzer用於自動補全,它採用keyword分詞器(即不分詞)然後轉為拼音
4.2 修改HotelDoc實體
- HotelDoc中要新增一個欄位,用來做自動補全,內容可以是酒店品牌、城市、商圈等資訊。按照自動補全欄位的要求,最好是這些欄位的陣列。
- 因此我們在HotelDoc中新增一個suggestion欄位,型別為
List<String>
,然後將brand、city、business等資訊放到裡面。
點選檢視程式碼
package com.yppah.hoteldemo.pojo;
import lombok.Data;
import lombok.NoArgsConstructor;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
@Data
@NoArgsConstructor
public class HotelDoc {
private Long id;
private String name;
private String address;
private Integer price;
private Integer score;
private String brand;
private String city;
private String starName;
private String business;
private String location;
private String pic;
private Object distance;
private Boolean isAD;
// private String isAD;
private List<String> suggestion; //儲存給使用者自動補全的內容
public HotelDoc(Hotel hotel) {
this.id = hotel.getId();
this.name = hotel.getName();
this.address = hotel.getAddress();
this.price = hotel.getPrice();
this.score = hotel.getScore();
this.brand = hotel.getBrand();
this.city = hotel.getCity();
this.starName = hotel.getStarName();
this.business = hotel.getBusiness();
this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
this.pic = hotel.getPic();
// 組裝suggestion
// this.suggestion = Arrays.asList(this.brand, this.business);
if(this.business.contains("、")){
// business有多個值,需要切割
String[] arr = this.business.split("、");
// 新增元素
this.suggestion = new ArrayList<>();
this.suggestion.add(this.brand);
Collections.addAll(this.suggestion, arr); //Collections.addAll批量新增
}else {
this.suggestion = Arrays.asList(this.brand, this.business);
}
}
}
4.3 重新批量匯入酒店資料
測試
GET /hotel/_search
{
"suggest": {
"suggestions": {
"text": "h",
"completion": {
"field": "suggestion",
"skip_duplicates": true,
"size": 5
}
}
}
}
點選檢視程式碼
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"suggestions" : [
{
"text" : "h",
"offset" : 0,
"length" : 1,
"options" : [
{
"text" : "和頤",
"_index" : "hotel",
"_type" : "_doc",
"_id" : "416268",
"_score" : 1.0,
"_source" : {
"address" : "朝陽路高井176號",
"brand" : "和頤",
"business" : "國貿地區",
"city" : "北京",
"id" : 416268,
"location" : "39.918277, 116.53015",
"name" : "和頤酒店(北京傳媒大學財滿街店)",
"pic" : "https://m.tuniucdn.com/fb2/t1/G6/M00/52/13/Cii-TF3eP5GIJIOLAAUwsIVCxdAAAGKXgK5a0IABTDI239_w200_h200_c1_t0.jpg",
"price" : 524,
"score" : 46,
"starName" : "三鑽",
"suggestion" : [
"和頤",
"國貿地區"
]
}
},
{
"text" : "漢庭",
"_index" : "hotel",
"_type" : "_doc",
"_id" : "607915",
"_score" : 1.0,
"_source" : {
"address" : "濱河大道6033號海濱廣場國皇大廈3樓",
"brand" : "漢庭",
"business" : "皇崗口岸/福田口岸",
"city" : "深圳",
"id" : 607915,
"location" : "22.528101, 114.064221",
"name" : "漢庭酒店(深圳皇崗店)",
"pic" : "https://m.tuniucdn.com/fb3/s1/2n9c/qMyCJVYuW21nsCeEBt8CMfmEhra_w200_h200_c1_t0.jpg",
"price" : 313,
"score" : 42,
"starName" : "二鑽",
"suggestion" : [
"漢庭",
"皇崗口岸/福田口岸"
]
}
},
{
"text" : "海岸城/後海",
"_index" : "hotel",
"_type" : "_doc",
"_id" : "1406627919",
"_score" : 1.0,
"_source" : {
"address" : "海德一道88號中洲控股中心A座",
"brand" : "萬豪",
"business" : "海岸城/後海",
"city" : "深圳",
"id" : 1406627919,
"location" : "22.517293, 113.933785",
"name" : "深圳中洲萬豪酒店",
"pic" : "https://m.tuniucdn.com/fb3/s1/2n9c/3wsinQAcuWtCdmv1yxauVG2PSYpC_w200_h200_c1_t0.jpg",
"price" : 204,
"score" : 47,
"starName" : "五鑽",
"suggestion" : [
"萬豪",
"海岸城/後海"
]
}
},
{
"text" : "皇冠假日",
"_index" : "hotel",
"_type" : "_doc",
"_id" : "56392",
"_score" : 1.0,
"_source" : {
"address" : "番禺路400號",
"brand" : "皇冠假日",
"business" : "徐家彙地區",
"city" : "上海",
"id" : 56392,
"location" : "31.202768, 121.429524",
"name" : "上海銀星皇冠假日酒店",
"pic" : "https://m.tuniucdn.com/fb3/s1/2n9c/37ucQ38K3UFdcRqntJ8M5dt884HR_w200_h200_c1_t0.jpg",
"price" : 809,
"score" : 47,
"starName" : "五星級",
"suggestion" : [
"皇冠假日",
"徐家彙地區"
]
}
},
{
"text" : "豪生",
"_index" : "hotel",
"_type" : "_doc",
"_id" : "45870",
"_score" : 1.0,
"_source" : {
"address" : "新元南路555號",
"brand" : "豪生",
"business" : "滴水湖臨港地區",
"city" : "上海",
"id" : 45870,
"location" : "30.871729, 121.81959",
"name" : "上海臨港豪生大酒店",
"pic" : "https://m.tuniucdn.com/fb3/s1/2n9c/2F5HoQvBgypoDUE46752ppnQaTqs_w200_h200_c1_t0.jpg",
"price" : 896,
"score" : 45,
"starName" : "四星級",
"suggestion" : [
"豪生",
"滴水湖臨港地區"
]
}
}
]
}
]
}
}
至此,基於DSL的酒店資料自動補全功能已實現(以拼音方式)
4.4 自動補全查詢的JavaAPI
4.4.1
- 之前我們學習了自動補全查詢的DSL,而沒有學習對應的JavaAPI,這裡給出一個示例:
- 測試
@Test void testSuggestion() throws IOException { // 1. 準備request SearchRequest request = new SearchRequest("hotel"); // 2. 準備DSL request.source().suggest(new SuggestBuilder().addSuggestion( "suggestions", SuggestBuilders.completionSuggestion("suggestion") .prefix("h") .skipDuplicates(true) .size(5) )); // 3. 發起請求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4. 解析響應 System.out.println(response); }
4.4.2
-
而自動補全的結果也比較特殊,解析的程式碼如下:
-
測試
點選檢視程式碼
@Test void testSuggestion() throws IOException { // 1. 準備request SearchRequest request = new SearchRequest("hotel"); // 2. 準備DSL request.source().suggest(new SuggestBuilder().addSuggestion( "suggestions", SuggestBuilders.completionSuggestion("suggestion") .prefix("h") .skipDuplicates(true) .size(5) )); // 3. 發起請求 SearchResponse response = client.search(request, RequestOptions.DEFAULT); // 4. 解析響應 // System.out.println(response); Suggest suggest = response.getSuggest(); // 4.1 根據補全查詢名稱獲取補全結果 CompletionSuggestion suggentions = suggest.getSuggestion("suggestions"); // 4.2 獲取options List<CompletionSuggestion.Entry.Option> options = suggentions.getOptions(); // 4.3 遍歷options for (CompletionSuggestion.Entry.Option option: options) { String text = option.getText().toString(); System.out.println(text); } }
4.5 實現搜尋框自動補全
- 檢視前端頁面,可以發現當我們在輸入框鍵入時,前端會發起ajax請求:
- 返回值是補全詞條的集合,型別為
List<String>
4.5.1 HotelController
@GetMapping("suggestion")
public List<String> getSuggestion(@RequestParam("key") String prefix) {
return hotelService.getSuggestion(prefix);
}
4.5.2 IhotelService
List<String> getSuggestion(String prefix);
4.5.3 HotelService
點選檢視程式碼
// 參考HotelSearchTest的testAggregation()
@Override
public List<String> getSuggestion(String prefix) {
//ctrl+alt+t快捷鍵,利用trycatch或者其他將目的碼塊包含起來
try {
// 1. 準備request
SearchRequest request = new SearchRequest("hotel");
// 2. 準備DSL
request.source().suggest(new SuggestBuilder().addSuggestion(
"suggestions",
SuggestBuilders.completionSuggestion("suggestion")
.prefix(prefix)
.skipDuplicates(true)
.size(5)
));
// 3. 發起請求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
// 4. 解析響應
// System.out.println(response);
Suggest suggest = response.getSuggest();
// 4.1 根據補全查詢名稱獲取補全結果
CompletionSuggestion suggentions = suggest.getSuggestion("suggestions");
// 4.2 獲取options
List<CompletionSuggestion.Entry.Option> options = suggentions.getOptions();
// 4.3 遍歷options
List<String> resList = new ArrayList<>(options.size());
for (CompletionSuggestion.Entry.Option option: options) {
String text = option.getText().toString();
resList.add(text);
}
return resList;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
4.5.4 重啟服務測試
至此,自動補全和拼音搜尋功能均已實現