1. 程式人生 > 實用技巧 >知識問答檢索中的分桶檢索相關設定

知識問答檢索中的分桶檢索相關設定

1 分桶檢索的需求

基於索引的QA問答對匹配流程梳理的匹配原理介紹中,我們對QA的相似問進行了入庫預處理,並生成了相關的特徵向量。在入庫時我們是針對問題進行的入庫,但在實際的業務場景中,每一個類目下有很多的知識,每個知識又有很多的問法,如果單純的進行了相似問法匹配返回問法的得分,就會出現同一個知識的問法佔據了topN問題。針對這個問題,我們希望針對檢索的問法進行合併,每一個知識僅返回該知識中得分最高的一條即可,同時返回的問法數量可以控制。

2 設計實現

ES在欄位設計時增加kid知識欄位,用於儲存每一個問法所屬的知識id,是一對多的形式,在檢索時基於kid欄位進行分組查詢,每組返回一條得分最高的資料,同時設定返回的分桶數量。
經過上述設計後,進行了資料實現,並測試驗證(此程式碼後續驗證有bug),分組查詢的相關程式碼如下所示:

// 查詢條件封裝
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 構建morelikethis查詢語句
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                    .filter(QueryBuilders.termsQuery("online", "1"))
                    .filter(QueryBuilders.termsQuery("userId", userId))
                    .filter(QueryBuilders.termsQuery("category", category.split(",")))
                    .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
// 排序規則
AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
// 獲取每個分組時間倒序排列的第一條記錄
AggregationBuilder top = AggregationBuilders.topHits("result")
                .fetchSource(new String[]{"id", "title", "kId"}, null)
                .size(1);
// 封裝分組查詢的相關條件
TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                .field("kId").executionHint("map");
// 返回分組數
groupTermsBuilder.size(maxNum);
groupTermsBuilder.subAggregation(top);
groupTermsBuilder.subAggregation(maxScore);

searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);

在進行驗證查詢時發現,每個組是返回了該組的最高得分,但是組之間還存在更高的得分的問題,如下查詢結果所示(結果做了處理,僅展示):

{
	"took": 2,
	"timed_out": false,
	"_shards": {
		"total": 3,
		"successful": 3,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 135,
		"max_score": 25.215496,
		"hits": [{
				"_index": "qaknowwledge",
				"_type": "doc",
				"_id": "11657847994935215",
				"_score": 25.215496,
				"_source": {
					"category": "11656187146936040",
					"id": "11657847994935215",
					"kId": "11657847993624508",
					"online": "1",
					"qStr": "儲存的問法1",
					"title": "知識1",
					"userId": "10869305621348777"
				}
			},
			{
				"_index": "qaknowwledge",
				"_type": "doc",
				"_id": "11657847994935216",
				"_score": 10.988454,
				"_source": {
					"category": "11656187146936040",
					"id": "11657847994935216",
					"kId": "11657847993624508",
					"online": "1",
					"questionStr": "問法2",
					"title": "知識2",
					"userId": "10869305621348777"
				}
			}
		]
	},
	"aggregations": {
		"groupkId": {
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 72,
			"buckets": [{
					"key": "11657847993624494",
					"doc_count": 5,
					"result": {
						"hits": {
							"total": 5,
							"max_score": 3.8905885,
							"hits": [{
								"_index": "qaknowwledge",
								"_type": "doc",
								"_id": "11657847994935160",
								"_score": 3.8905885,
								"_source": {
									"kId": "11657847993624494",
									"id": "11657847994935160",
									"title": "知識"
								}
							}]
						}
					},
					"scoreTop": {
						"value": 3.8905885219573975
					}
				}
			]
		}
	}
}

我們發現打分搞的第一條記錄並沒有出現在分組的查詢中,我們把查詢語句打印出來如下:

{
  "size": 20,
  "timeout": "60s",
  "query": {
    "bool": {
      "must": [
        {
          "more_like_this": {
            "fields": [
              "questionStr"
            ],
            "like": [
              "問法"
            ],
            "max_query_terms": 25,
            "min_term_freq": 0,
            "min_doc_freq": 0,
            "max_doc_freq": 2147483647,
            "min_word_length": 2,
            "max_word_length": 0,
            "minimum_should_match": "30%",
            "boost_terms": 0,
            "include": false,
            "fail_on_unsupported_field": true,
            "boost": 1
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "online": [
              "1"
            ],
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "aggregations": {
    "groupkId": {
      "terms": {
        "field": "kId",
        "size": 20,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "execution_hint": "map",
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "result": {
          "top_hits": {
            "from": 0,
            "size": 1,
            "version": false,
            "explain": false,
            "_source": {
              "includes": [
                "id",
                "title",
                "kId"
              ],
              "excludes": []
            }
          }
        },
        "scoreTop": {
          "max": {
            "script": {
              "source": "_score",
              "lang": "painless"
            }
          }
        }
      }
    }
  }
}

分析發現,我們設定的排序策略並沒有生效,從上文看排序仍然是按照分組匹配到的數量進行的排序,也就是

"terms": {
	"field": "kId",
	"size": 20,
	"min_doc_count": 1,
	"shard_min_doc_count": 0,
	"show_term_doc_count_error": false,
	"execution_hint": "map",
	"order": [{
			"_count": "desc"
		},
		{
			"_key": "asc"
		}
	]
}

對上述查詢程式碼進行檢視,發現我們僅設定了聚合後的查詢欄位,但是該查詢欄位並沒有應用到分組上,進行處理即可,程式碼如下:

// 查詢條件封裝
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 構建morelikethis查詢語句
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                    .filter(QueryBuilders.termsQuery("online", "1"))
                    .filter(QueryBuilders.termsQuery("userId", userId))
                    .filter(QueryBuilders.termsQuery("category", category.split(",")))
                    .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
// 排序規則
AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
// 獲取每個分組時間倒序排列的第一條記錄
AggregationBuilder top = AggregationBuilders.topHits("result")
                .fetchSource(new String[]{"id", "title", "kId", "qSimhas"}, null)
                .size(1);
// 封裝分組查詢的相關條件
TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                .field("kId").executionHint("map").order(BucketOrder.aggregation("scoreTop", false));
// 返回分組數
groupTermsBuilder.size(maxNum);
groupTermsBuilder.subAggregation(top);
groupTermsBuilder.subAggregation(maxScore);

searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);

即將"scoreTop"應用到"groupTermsBuilder"上即可,這樣對打印出的查詢語句即可看到,排序已經按照每組的查詢最高分進行了。

參考:
es term 聚合時能按_score進行排序麼
es java api 進行聚合+桶聚合查詢 terms+top_hits+max