知識問答檢索中的分桶檢索相關設定
阿新 • • 發佈:2020-08-27
1 分桶檢索的需求
在基於索引的QA問答對匹配流程梳理的匹配原理介紹中,我們對QA的相似問進行了入庫預處理,並生成了相關的特徵向量。在入庫時我們是針對問題進行的入庫,但在實際的業務場景中,每一個類目下有很多的知識,每個知識又有很多的問法,如果單純的進行了相似問法匹配返回問法的得分,就會出現同一個知識的問法佔據了topN問題。針對這個問題,我們希望針對檢索的問法進行合併,每一個知識僅返回該知識中得分最高的一條即可,同時返回的問法數量可以控制。
2 設計實現
ES在欄位設計時增加kid知識欄位,用於儲存每一個問法所屬的知識id,是一對多的形式,在檢索時基於kid欄位進行分組查詢,每組返回一條得分最高的資料,同時設定返回的分桶數量。
經過上述設計後,進行了資料實現,並測試驗證(此程式碼後續驗證有bug),分組查詢的相關程式碼如下所示:
// 查詢條件封裝 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); // 構建morelikethis查詢語句 BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery() .filter(QueryBuilders.termsQuery("online", "1")) .filter(QueryBuilders.termsQuery("userId", userId)) .filter(QueryBuilders.termsQuery("category", category.split(","))) .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2)); // 排序規則 AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score"); // 獲取每個分組時間倒序排列的第一條記錄 AggregationBuilder top = AggregationBuilders.topHits("result") .fetchSource(new String[]{"id", "title", "kId"}, null) .size(1); // 封裝分組查詢的相關條件 TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId") .field("kId").executionHint("map"); // 返回分組數 groupTermsBuilder.size(maxNum); groupTermsBuilder.subAggregation(top); groupTermsBuilder.subAggregation(maxScore); searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
在進行驗證查詢時發現,每個組是返回了該組的最高得分,但是組之間還存在更高的得分的問題,如下查詢結果所示(結果做了處理,僅展示):
{ "took": 2, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 135, "max_score": 25.215496, "hits": [{ "_index": "qaknowwledge", "_type": "doc", "_id": "11657847994935215", "_score": 25.215496, "_source": { "category": "11656187146936040", "id": "11657847994935215", "kId": "11657847993624508", "online": "1", "qStr": "儲存的問法1", "title": "知識1", "userId": "10869305621348777" } }, { "_index": "qaknowwledge", "_type": "doc", "_id": "11657847994935216", "_score": 10.988454, "_source": { "category": "11656187146936040", "id": "11657847994935216", "kId": "11657847993624508", "online": "1", "questionStr": "問法2", "title": "知識2", "userId": "10869305621348777" } } ] }, "aggregations": { "groupkId": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 72, "buckets": [{ "key": "11657847993624494", "doc_count": 5, "result": { "hits": { "total": 5, "max_score": 3.8905885, "hits": [{ "_index": "qaknowwledge", "_type": "doc", "_id": "11657847994935160", "_score": 3.8905885, "_source": { "kId": "11657847993624494", "id": "11657847994935160", "title": "知識" } }] } }, "scoreTop": { "value": 3.8905885219573975 } } ] } } }
我們發現打分搞的第一條記錄並沒有出現在分組的查詢中,我們把查詢語句打印出來如下:
{
"size": 20,
"timeout": "60s",
"query": {
"bool": {
"must": [
{
"more_like_this": {
"fields": [
"questionStr"
],
"like": [
"問法"
],
"max_query_terms": 25,
"min_term_freq": 0,
"min_doc_freq": 0,
"max_doc_freq": 2147483647,
"min_word_length": 2,
"max_word_length": 0,
"minimum_should_match": "30%",
"boost_terms": 0,
"include": false,
"fail_on_unsupported_field": true,
"boost": 1
}
}
],
"filter": [
{
"terms": {
"online": [
"1"
],
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"groupkId": {
"terms": {
"field": "kId",
"size": 20,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"execution_hint": "map",
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"result": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"_source": {
"includes": [
"id",
"title",
"kId"
],
"excludes": []
}
}
},
"scoreTop": {
"max": {
"script": {
"source": "_score",
"lang": "painless"
}
}
}
}
}
}
}
分析發現,我們設定的排序策略並沒有生效,從上文看排序仍然是按照分組匹配到的數量進行的排序,也就是
"terms": {
"field": "kId",
"size": 20,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"execution_hint": "map",
"order": [{
"_count": "desc"
},
{
"_key": "asc"
}
]
}
對上述查詢程式碼進行檢視,發現我們僅設定了聚合後的查詢欄位,但是該查詢欄位並沒有應用到分組上,進行處理即可,程式碼如下:
// 查詢條件封裝
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 構建morelikethis查詢語句
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
.filter(QueryBuilders.termsQuery("online", "1"))
.filter(QueryBuilders.termsQuery("userId", userId))
.filter(QueryBuilders.termsQuery("category", category.split(",")))
.must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
// 排序規則
AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
// 獲取每個分組時間倒序排列的第一條記錄
AggregationBuilder top = AggregationBuilders.topHits("result")
.fetchSource(new String[]{"id", "title", "kId", "qSimhas"}, null)
.size(1);
// 封裝分組查詢的相關條件
TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
.field("kId").executionHint("map").order(BucketOrder.aggregation("scoreTop", false));
// 返回分組數
groupTermsBuilder.size(maxNum);
groupTermsBuilder.subAggregation(top);
groupTermsBuilder.subAggregation(maxScore);
searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
即將"scoreTop"應用到"groupTermsBuilder"上即可,這樣對打印出的查詢語句即可看到,排序已經按照每組的查詢最高分進行了。
參考:
es term 聚合時能按_score進行排序麼
es java api 進行聚合+桶聚合查詢 terms+top_hits+max