lucene原始碼分析---8
lucene原始碼分析—查詢過程
本章開始介紹lucene的查詢過程,即IndexSearcher的search函式,
IndexSearcher::search
public TopDocs search(Query query, int n)
throws IOException {
return searchAfter(null, query, n);
}
public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException {
final int limit = Math.max(1, reader.maxDoc());
numHits = Math.min(numHits, limit);
final int cappedNumHits = Math.min(numHits, limit);
final CollectorManager<TopScoreDocCollector, TopDocs> manager = new CollectorManager<TopScoreDocCollector, TopDocs>() {
@Override
public TopScoreDocCollector newCollector() throws IOException {
...
}
@Override
public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException {
...
}
};
return search(query, manager);
}
傳入的引數query封裝了查詢語句,n代表取前n個結果。searchAfter函式前面的計算保證最後的文件數量n不會超過所有文件的數量。接下來建立CollectorManager,並呼叫過載的search繼續執行。
IndexSearch::search->searchAfter->search
public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
if (executor == null) {
final C collector = collectorManager.newCollector();
search(query, collector);
return collectorManager.reduce(Collections.singletonList(collector));
} else {
...
}
}
假設查詢過程為單執行緒,此時executor為空。首先通過CollectorManager的newCollector建立TopScoreDocCollector,每個TopScoreDocCollector封裝了最後的查詢結果,如果是多執行緒查詢,則最後要對多個TopScoreDocCollector進行合併。
IndexSearch::search->searchAfter->search->CollectorManager::newCollector
public TopScoreDocCollector newCollector() throws IOException {
return TopScoreDocCollector.create(cappedNumHits, after);
}
public static TopScoreDocCollector create(int numHits, ScoreDoc after) {
if (after == null) {
return new SimpleTopScoreDocCollector(numHits);
} else {
return new PagingTopScoreDocCollector(numHits, after);
}
}
引數after用來實現類似分頁的效果,這裡假設為null。newCollector函式最終返回SimpleTopScoreDocCollector。建立完TopScoreDocCollector後,接下來呼叫過載的search函式繼續執行。
IndexSearch::search->searchAfter->search->search
public void search(Query query, Collector results)
throws IOException {
search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
}
leafContexts是CompositeReaderContext中的leaves成員變數,是一個LeafReaderContext列表,每個LeafReaderContext封裝了每個段的SegmentReader,SegmentReader可以讀取每個段的所有資訊和資料。接下來通過createNormalizedWeight函式進行查詢匹配,並計算一些基本的權重用來給後面的打分過程使用。
public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
query = rewrite(query);
Weight weight = createWeight(query, needsScores);
float v = weight.getValueForNormalization();
float norm = getSimilarity(needsScores).queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f);
return weight;
}
首先通過rewrite函式對Query進行重寫,例如刪除一些不必要的項,將非原子查詢轉化為原子查詢。
rewrite
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->rewrite
public Query rewrite(Query original) throws IOException {
Query query = original;
for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query;
rewrittenQuery = query.rewrite(reader)) {
query = rewrittenQuery;
}
return query;
}
這裡迴圈呼叫每個Query的rewrite函式進行重寫,之所以迴圈是因為可能一次重寫改變Query結構後又產生了可以被重寫的部分,下面假設這裡的query為BooleanQuery,BooleanQuery並不包含真正的查詢語句,而是包含多個子查詢,每個子查詢可以是TermQuery這樣不可再分的Query,也可以是另一個BooleanQuery。
由於BooleanQuery的rewrite函式較長,下面分段來看。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第一部分
public Query rewrite(IndexReader reader) throws IOException {
if (clauses.size() == 1) {
BooleanClause c = clauses.get(0);
Query query = c.getQuery();
if (minimumNumberShouldMatch == 1 && c.getOccur() == Occur.SHOULD) {
return query;
} else if (minimumNumberShouldMatch == 0) {
switch (c.getOccur()) {
case SHOULD:
case MUST:
return query;
case FILTER:
return new BoostQuery(new ConstantScoreQuery(query), 0);
case MUST_NOT:
return new MatchNoDocsQuery();
default:
throw new AssertionError();
}
}
}
...
}
如果BooleanQuery中只有一個子查詢,則沒必要對其封裝,直接取出該子查詢中的Query即可。
minimumNumberShouldMatch成員變量表示至少需要匹配多少項,如果唯一的子查詢條件為SHOULD,並且匹配1項就行了,則直接返回對應的Query,如果條件為MUST或者SHOULD,也是直接返回子查詢中的Query,如果條件為FILTER,則直接通過BoostQuery封裝並返回,如果條件為MUST_NOT,則說明唯一的查詢不需要查詢任何文件,直接建立MatchNODocsQuery即可。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第二部分
public Query rewrite(IndexReader reader) throws IOException {
...
{
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.setDisableCoord(isCoordDisabled());
builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
boolean actuallyRewritten = false;
for (BooleanClause clause : this) {
Query query = clause.getQuery();
Query rewritten = query.rewrite(reader);
if (rewritten != query) {
actuallyRewritten = true;
}
builder.add(rewritten, clause.getOccur());
}
if (actuallyRewritten) {
return builder.build();
}
}
...
}
這部分rewrite函式遍歷BooleanQuery下的所有的子查詢列表,巢狀呼叫rewrite函式,如果某次rewrite函式返回的Query和原來的Query不一樣,則說明某個子查詢被重寫了,此時通過BooleanQuery.Builder的build函式重新生成BooleanQuery。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第三部分
public Query rewrite(IndexReader reader) throws IOException {
...
{
int clauseCount = 0;
for (Collection<Query> queries : clauseSets.values()) {
clauseCount += queries.size();
}
if (clauseCount != clauses.size()) {
BooleanQuery.Builder rewritten = new BooleanQuery.Builder();
rewritten.setDisableCoord(disableCoord);
rewritten.setMinimumNumberShouldMatch(minimumNumberShouldMatch);
for (Map.Entry<Occur, Collection<Query>> entry : clauseSets.entrySet()) {
final Occur occur = entry.getKey();
for (Query query : entry.getValue()) {
rewritten.add(query, occur);
}
}
return rewritten.build();
}
}
...
}
clauseSets中儲存了MUST_NOT和FILTER對應的子查詢Clause,並使用HashSet進行儲存。利用HashSet的結構可以去除重複的條件為MUST_NOT和FILTER的子查詢。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第四部分
public Query rewrite(IndexReader reader) throws IOException {
...
if (clauseSets.get(Occur.MUST).size() > 0 && clauseSets.get(Occur.FILTER).size() > 0) {
final Set<Query> filters = new HashSet<Query>(clauseSets.get(Occur.FILTER));
boolean modified = filters.remove(new MatchAllDocsQuery());
modified |= filters.removeAll(clauseSets.get(Occur.MUST));
if (modified) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.setDisableCoord(isCoordDisabled());
builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
for (BooleanClause clause : clauses) {
if (clause.getOccur() != Occur.FILTER) {
builder.add(clause);
}
}
for (Query filter : filters) {
builder.add(filter, Occur.FILTER);
}
return builder.build();
}
}
...
}
刪除條件為FILTER又同時為MUST的子查詢,同時刪除查詢所有文件的子查詢(因為此時子查詢的數量肯定大於1),查詢所有文件的結果集裡包含了任何其他查詢的結果集。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第五部分
{
final Collection<Query> musts = clauseSets.get(Occur.MUST);
final Collection<Query> filters = clauseSets.get(Occur.FILTER);
if (musts.size() == 1
&& filters.size() > 0) {
Query must = musts.iterator().next();
float boost = 1f;
if (must instanceof BoostQuery) {
BoostQuery boostQuery = (BoostQuery) must;
must = boostQuery.getQuery();
boost = boostQuery.getBoost();
}
if (must.getClass() == MatchAllDocsQuery.class) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
for (BooleanClause clause : clauses) {
switch (clause.getOccur()) {
case FILTER:
case MUST_NOT:
builder.add(clause);
break;
default:
break;
}
}
Query rewritten = builder.build();
rewritten = new ConstantScoreQuery(rewritten);
builder = new BooleanQuery.Builder()
.setDisableCoord(isCoordDisabled())
.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch())
.add(rewritten, Occur.MUST);
for (Query query : clauseSets.get(Occur.SHOULD)) {
builder.add(query, Occur.SHOULD);
}
rewritten = builder.build();
return rewritten;
}
}
}
return super.rewrite(reader);
如果某個MatchAllDocsQuery是唯一的型別為MUST的Query,則對其進行重寫。最後如果沒有重寫,就呼叫父類Query的rewrite直接返回其自身。
看完了BooleanQuery的rewrite函式,下面簡單介紹一下其他型別Query的rewrite函式。
TermQuery的rewrite函式,直接返回自身。SynonymQuery的rewrite函式檢測是否只包含一個Query,如果只有一個Query,則將其轉化為TermQuery。WildcardQuery、PrefixQuery、RegexpQuery以及FuzzyQuery都繼承自MultiTermQuery。WildcardQuery的rewrite函式返回一個封裝了原來Query的MultiTermQueryConstantScoreWrapper。PrefixQuery的rewrite函式返回一個MultiTermQueryConstantScoreWrapper。RegexpQuery類似PrefixQuery。FuzzyQuery最後根據情況返回一個BlendedTermQuery。
回到createNormalizedWeight函式中,重寫完Query之後,接下來通過createWeight函式進行匹配並計算權重。
createWeight
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight
public Weight createWeight(Query query, boolean needsScores) throws IOException {
final QueryCache queryCache = this.queryCache;
Weight weight = query.createWeight(this, needsScores);
if (needsScores == false && queryCache != null) {
weight = queryCache.doCache(weight, queryCachingPolicy);
}
return weight;
}
IndexSearch中的成員變數queryCache被初始化為LRUQueryCache。createWeight函式會呼叫各個Query中的createWeight函式,假設為BooleanQuery。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanQuery::createWeight
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
BooleanQuery query = this;
if (needsScores == false) {
query = rewriteNoScoring();
}
return new BooleanWeight(query, searcher, needsScores, disableCoord);
}
needsScores在SimpleTopScoreDocCollector中預設返回true。createWeight函式建立BooleanWeight並返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanWeight::createWeight->BooleanWeight::BooleanWeight
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException {
super(query);
this.query = query;
this.needsScores = needsScores;
this.similarity = searcher.getSimilarity(needsScores);
weights = new ArrayList<>();
int i = 0;
int maxCoord = 0;
for (BooleanClause c : query) {
Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring());
weights.add(w);
if (c.isScoring()) {
maxCoord++;
}
i += 1;
}
this.maxCoord = maxCoord;
coords = new float[maxCoord+1];
Arrays.fill(coords, 1F);
coords[0] = 0f;
if (maxCoord > 0 && needsScores && disableCoord == false) {
boolean seenActualCoord = false;
for (i = 1; i < coords.length; i++) {
coords[i] = coord(i, maxCoord);
seenActualCoord |= (coords[i] != 1F);
}
this.disableCoord = seenActualCoord == false;
} else {
this.disableCoord = true;
}
}
getSimilarity函式預設返回IndexSearcher中的BM25Similarity。BooleanWeight函式巢狀呼叫createWeight獲取子查詢的Weight,假設子查詢為TermQuery,後面來看TermQuery的createWeight函式。maxCoord用來表示有多少個子查詢,最後面的coords陣列能夠影響檢索文件的得分,計算公式為coord(q,d) = q/d。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
final IndexReaderContext context = searcher.getTopReaderContext();
final TermContext termState;
if (perReaderTermState == null
|| perReaderTermState.topReaderContext != context) {
termState = TermContext.build(context, term);
} else {
termState = this.perReaderTermState;
}
return new TermWeight(searcher, needsScores, termState);
}
getTopReaderContext返回CompositeReaderContext,封裝了SegmentReader。
perReaderTermState預設為null,因此接下來通過TermContext的build函式進行匹配並獲取對應的Term在索引表中的相應資訊,最後根據得到的資訊TermContext建立TermWeight並返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build
public static TermContext build(IndexReaderContext context, Term term)
throws IOException {
final String field = term.field();
final BytesRef bytes = term.bytes();
final TermContext perReaderTermState = new TermContext(context);
for (final LeafReaderContext ctx : context.leaves()) {
final Terms terms = ctx.reader().terms(field);
if (terms != null) {
final TermsEnum termsEnum = terms.iterator();
if (termsEnum.seekExact(bytes)) {
final TermState termState = termsEnum.termState();
perReaderTermState.register(termState, ctx.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
}
}
}
return perReaderTermState;
}
Term的bytes函式返回查詢的位元組,預設的是UTF-8編碼。LeafReaderContext的reader函式返回SegmentReader,對應的terms函式返回FieldReader用來讀取檔案中的資訊。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->SegmentReader::terms
public final Terms terms(String field) throws IOException {
return fields().terms(field);
}
public final Fields fields() {
return getPostingsReader();
}
public FieldsProducer getPostingsReader() {
ensureOpen();
return core.fields;
}
public Terms terms(String field) throws IOException {
FieldsProducer fieldsProducer = fields.get(field);
return fieldsProducer == null ? null : fieldsProducer.terms(field);
}
core在SegmentReader建構函式中建立為SegmentCoreReaders,對應fields為PerFieldPostingsFormat。fields.get最終返回BlockTreeTermsReader,在建立索引時設定的。
BlockTreeTermsReader的terms最終返回對應域的FieldReader。
回到TermContext的build函式中,接下來的iterator函式返回SegmentTermsEnum,然後通過seekExact函式查詢匹配,如果匹配,通過SegmentTermsEnum的termState函式返回一個IntBlockTermState,裡面封裝該Term的各個資訊,seekExact函式在下一章分析。build函式最後通過TermContext的register函式儲存計算獲得的IntBlockTermState。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->register
public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
register(state, ord);
accumulateStatistics(docFreq, totalTermFreq);
}
public void register(TermState state, final int ord) {
states[ord] = state;
}
public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
this.docFreq += docFreq;
if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
this.totalTermFreq += totalTermFreq;
else
this.totalTermFreq = -1;
}
傳入的引數ord用於標識一個唯一的IndexReaderContext,即一個段。register函式將TermState,其實是IntBlockTermState儲存進陣列states中,然後通過accumulateStatistics更新統計資訊。
回到TermQuery的createWeight函式中,最後建立一個TermWeight並返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight
public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates)
throws IOException {
super(TermQuery.this);
this.needsScores = needsScores;
this.termStates = termStates;
this.similarity = searcher.getSimilarity(needsScores);
final CollectionStatistics collectionStats;
final TermStatistics termStats;
if (needsScores) {
collectionStats = searcher.collectionStatistics(term.field());
termStats = searcher.termStatistics(term, termStates);
} else {
...
}
this.stats = similarity.computeWeight(collectionStats, termStats);
}
整體上看,collectionStatistics函式用來統計某個域中的資訊,termStatistics函式用來統計某個詞的資訊。
最後通過這兩個資訊呼叫computeWeight函式計算權重。下面分別來看。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::collectionStatistics
public CollectionStatistics collectionStatistics(String field) throws IOException {
final int docCount;
final long sumTotalTermFreq;
final long sumDocFreq;
Terms terms = MultiFields.getTerms(reader, field);
if (terms == null) {
docCount = 0;
sumTotalTermFreq = 0;
sumDocFreq = 0;
} else {
docCount = terms.getDocCount();
sumTotalTermFreq = terms.getSumTotalTermFreq();
sumDocFreq = terms.getSumDocFreq();
}
return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
}
getTerms函式和前面的分析類似,最後返回一個FieldReader,然後獲取docCount文件數、sumTotalTermFreq所有termFreq(每篇文件有多少個Term)的總和、sumDocFreq所有docFreq(多少篇文件包含Term)的總和,最後建立CollectionStatistics封裝這些資訊並返回。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::termStatistics
public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq());
}
docFreq返回有多少篇文件包含該詞,totalTermFreq返回文件中包含多少個該詞,最後建立一個TermStatistics並返回,建構函式簡單。
回到TermWeight的建構函式中,similarity預設為BM25Similarity,computeWeight函式如下。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight
public final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) {
Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
float avgdl = avgFieldLength(collectionStats);
float cache[] = new float[256];
for (int i = 0; i < cache.length; i++) {
cache[i] = k1 * ((1 - b) + b * decodeNormValue((byte)i) / avgdl);
}
return new BM25Stats(collectionStats.field(), idf, avgdl, cache);
}
idfExplain函式用來計算idf,即反轉文件頻率。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->idfExplain
public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {
final long df = termStats.docFreq();
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
final float idf = idf(df, docCount);
return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");
}
df為多少篇文件包含該詞,docCount為文件總數,idf函式的計算公式如下,
1 + log(numDocs/(docFreq+1)),含義是如果文件中出現Term的頻率越高顯得文件越不重要。
回到computeWeight中,avgFieldLength函式用來計算每篇文件包含詞的平均數。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->avgFieldLength
protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
return 1f;
} else {
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
return (float) (sumTotalTermFreq / (double) docCount);
}
}
avgFieldLength函式將詞頻總數除以文件數,得到每篇文件的平均詞數。回到computeWeight中,接下來計算BM25的相關係數,BM25是lucene進行排序的演算法,最後建立BM25Stats並返回。
回到createNormalizedWeight中,接下來通過getValueForNormalization函式計算權重。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization
public float getValueForNormalization() throws IOException {
float sum = 0.0f;
int i = 0;
for (BooleanClause clause : query) {
float s = weights.get(i).getValueForNormalization();
if (clause.isScoring()) {
sum += s;
}
i += 1;
}
return sum ;
}
BooleanWeight的getValueForNormalization函式用來累積子查詢中getValueForNormalization函式返回的值。假設子查詢為TermQuery,對應的Weight為TermWeight,其getValueForNormalization函式如下。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization->TermWeight::getValueForNormalization
public float getValueForNormalization() {
return stats.getValueForNormalization();
}
public float getValueForNormalization() {
return weight * weight;
}
public void normalize(float queryNorm, float boost) {
this.boost = boost;
this.weight = idf.getValue() * boost;
}
stats就是是BM25Stats,其getValueForNormalization函式最終返回idf值乘以boost後的平方。
回到createNormalizedWeight中,queryNorm函式直接返回1,normalize函式根據norm重新計算權重。首先看BooleanWeight的normalize函式,
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::normalize
public void normalize(float norm, float boost) {
for (Weight w : weights) {
w.normalize(norm, boost);
}
}
假設子查詢對應的Weight為TermWeight。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->TermWeight::normalize
public void normalize(float queryNorm, float boost) {
stats.normalize(queryNorm, boost);
}
public void normalize(float queryNorm, float boost) {
this.boost = boost;
this.weight = idf.getValue() * boost;
}
回到IndexSearcher的search函式中,createNormalizedWeight返回Weight後,繼續呼叫過載的search函式,定義如下,
IndexSearch::search->searchAfter->search->search->search
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
throws IOException {
for (LeafReaderContext ctx : leaves) {
final LeafCollector leafCollector;
try {
leafCollector = collector.getLeafCollector(ctx);
} catch (CollectionTerminatedException e) {
}
BulkScorer scorer = weight.bulkScorer(ctx);
if (scorer != null) {
try {
scorer.score(leafCollector, ctx.reader().getLiveDocs());
} catch (CollectionTerminatedException e) {
}
}
}
}
根據《lucence原始碼分析—6》leaves是封裝了SegmentReader的LeafReaderContext列表,collector是SimpleTopScoreDocCollector。
IndexSearch::search->searchAfter->search->search->search->SimpleTopScoreDocCollector::getLeafCollector
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
return new ScorerLeafCollector() {
@Override
public void collect(int doc) throws IOException {
float score = scorer.score();
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}
};
}
getLeafCollector函式建立ScorerLeafCollector並返回。
回到search函式中,接下來通過Weight的bulkScorer函式獲得BulkScorer,用來計算得分。
bulkScorer
假設通過createNormalizedWeight函式建立的Weight為BooleanWeight,下面來看其bulkScorer函式,
IndexSearcher::search->searchAfter->search->search->search->BooleanWeight::bulkScorer
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
final BulkScorer bulkScorer = booleanScorer(context);
if (bulkScorer != null) {
return bulkScorer;
} else {
return super.bulkScorer(context);
}
}
bulkScorer函式首先建立一個booleanScorer,假設為null,下面呼叫其父類Weight的bulkScorer函式並返回。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
Scorer scorer = scorer(context);
if (scorer == null) {
return null;
}
return new DefaultBulkScorer(scorer);
}
scorer函式重定義在BooleanWeight中,
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer
public Scorer scorer(LeafReaderContext context) throws IOException {
int minShouldMatch = query.getMinimumNumberShouldMatch();
List<Scorer> required = new ArrayList<>();
List<Scorer> requiredScoring = new ArrayList<>();
List<Scorer> prohibited = new ArrayList<>();
List<Scorer> optional = new ArrayList<>();
Iterator<BooleanClause> cIter = query.iterator();
for (Weight w : weights) {
BooleanClause c = cIter.next();
Scorer subScorer = w.scorer(context);
if (subScorer == null) {
if (c.isRequired()) {
return null;
}
} else if (c.isRequired()) {
required.add(subScorer);
if (c.isScoring()) {
requiredScoring.add(subScorer);
}
} else if (c.isProhibited()) {
prohibited.add(subScorer);
} else {
optional.add(subScorer);
}
}
if (optional.size() == minShouldMatch) {
required.addAll(optional);
requiredScoring.addAll(optional);
optional.clear();
minShouldMatch = 0;
}
if (required.isEmpty() && optional.isEmpty()) {
return null;
} else if (optional.size() < minShouldMatch) {
return null;
}
if (!needsScores && minShouldMatch == 0 && required.size() > 0) {
optional.clear();
}
if (optional.isEmpty()) {
return excl(req(required, requiredScoring, disableCoord), prohibited);
}
if (required.isEmpty()) {
return excl(opt(optional, minShouldMatch, disableCoord), prohibited);
}
Scorer req = excl(req(required, requiredScoring, true), prohibited);
Scorer opt = opt(optional, minShouldMatch, true);
if (disableCoord) {
if (minShouldMatch > 0) {
return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), 1F);
} else {
return new ReqOptSumScorer(req, opt);
}
} else if (optional.size() == 1) {
if (minShouldMatch > 0) {
return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), coord(requiredScoring.size()+1, maxCoord));
} else {
float coordReq = coord(requiredScoring.size(), maxCoord);
float coordBoth = coord(requiredScoring.size() + 1, maxCoord);
return new BooleanTopLevelScorers.ReqSingleOptScorer(req, opt, coordReq, coordBoth);
}
} else {
if (minShouldMatch > 0) {
return new BooleanTopLevelScorers.CoordinatingConjunctionScorer(this, coords, req, requiredScoring.size(), opt);
} else {
return new BooleanTopLevelScorers.ReqMultiOptScorer(req, opt, requiredScoring.size(), coords);
}
}
}
BooleanWeight的scorer函式會迴圈呼叫每個子查詢對應的Weight的scorer函式,假設為TermWeight。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer
public Scorer scorer(LeafReaderContext context) throws IOException {
final TermsEnum termsEnum = getTermsEnum(context);
PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
return new TermScorer(this, docs, similarity.simScorer(stats, context));
}
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->getTermsEnum
private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
final TermState state = termStates.get(context.ord);
final TermsEnum termsEnum = context.reader().terms(term.field())
.iterator();
termsEnum.seekExact(term.bytes(), state);
return termsEnum;
}
首先獲得前面查詢的結果TermState,iterator函式返回SegmentTermsEnum。SegmentTermsEnum的seekExact函式主要是封裝前面的查詢結果TermState,具體的細節下一章再研究。
回到TermWeight的scorer函式中,接下來呼叫SegmentTermsEnum的postings函式,最終呼叫Lucene50PostingsReader的postings函式。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->Lucene50PostingsReader::postings
public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, PostingsEnum reuse, int flags) throws IOException {
boolean indexHasPositions = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
boolean indexHasOffsets = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) >= 0;
boolean indexHasPayloads = fieldInfo.hasPayloads();
if (indexHasPositions == false || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) {
BlockDocsEnum docsEnum;
if (reuse instanceof BlockDocsEnum) {
...
} else {
docsEnum = new BlockDocsEnum(fieldInfo);
}
return docsEnum.reset((IntBlockTermState) termState, flags);
} else if ((indexHasOffsets == false || PostingsEnum.featureRequested(flags, PostingsEnum.OFFSETS) == false) &&
(indexHasPayloads == false || PostingsEnum.featureRequested(flags, PostingsEnum.PAYLOADS) == false)) {
...
} else {
...
}
}
首先取出索引檔案中的儲存型別。假設進入第一個if語句,reuse引數預設為null,因此接下來建立一個BlockDocsEnum,並通過reset函式初始化。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->BM25Similarity::simScorer
public final SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException {
BM25Stats bm25stats = (BM25Stats) stats;
return new BM25DocScorer(bm25stats, context.reader().getNormValues(bm25stats.field));
}
public final NumericDocValues getNormValues(String field) throws IOException {
ensureOpen();
Map<String,NumericDocValues> normFields = normsLocal.get();
NumericDocValues norms = normFields.get(field);
if (norms != null) {
return norms;
} else {
FieldInfo fi = getFieldInfos().fieldInfo(field);
if (fi == null || !fi.hasNorms()) {
return null;
}
norms = getNormsReader().getNorms(fi);
normFields.put(field, norms);
return norms;
}
}
getNormsReader返回Lucene53NormsProducer,Lucene53NormsProducer的getNorms函式根據域資訊以及.nvd、.nvm檔案讀取資訊建立NumericDocValues並返回。simScorer最終建立BM25DocScorer並返回。
回到TermWeight的scorer函式中,最後建立TermScorer並返回。
再回到BooleanWeight的scorer函式中,如果SHOULD條件的Scorer等於minShouldMatch,則表明及時條件為SHOULD但也必須得到滿足,此時將其歸入MUST條件中。再往下,如果SHOULD以及MUST對應的Scorer都為空,則表明沒有任何查詢條件,返回空,如果SHOULD條件的Scorer小於minShouldMatch,則表明SHOULD條件下查詢到的匹配字元太少,也返回空。再往下,如果optional為空,則沒有SHOULD條件的Scorer,此時通過req封裝MUST條件的Scorer,並通過excl排除MUST_NOT條件的Scorer;相反,如果required為空,則沒有MUST條件的Scorer,此時通過opt封裝SHOULD條件的Scorer。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->req
private Scorer req(List<Scorer> required, List<Scorer> requiredScoring, boolean disableCoord) {
if (required.size() == 1) {
Scorer req = required.get(0);
if (needsScores == false) {
return req;
}
if (requiredScoring.isEmpty()) {
return new FilterScorer(req) {
@Override
public float score() throws IOException {
return 0f;
}
@Override
public int freq() throws IOException {
return 0;
}
};
}
float boost = 1f;
if (disableCoord == false) {
boost = coord(1, maxCoord);
}
if (boost == 1f) {
return req;
}
return new BooleanTopLevelScorers.BoostedScorer(req, boost);
} else {
return new ConjunctionScorer(this, required, requiredScoring,
disableCoord ? 1.0F : coord(requiredScoring.size(), maxCoord));
}
}
如果MUST條件下的Scorer數量大於1,則直接建立ConjunctionScorer並返回,如果requiredScoring為空,則對應唯一的MUST條件下的Scorer並不需要評分,此時直接返回FilterScorer,否則通過計算返回BoostedScorer,其他情況下直接返回對應的Scorer。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->excl
private Scorer excl(Scorer main, List<Scorer> prohibited) throws IOException {
if (prohibited.isEmpty()) {
return main;
} else if (prohibited.size() == 1) {
return new ReqExclScorer(main, prohibited.get(0));
} else {
float coords[] = new float[prohibited.size()+1];
Arrays.fill(coords, 1F);
return new ReqExclScorer(main, new DisjunctionSumScorer(this, prohibited, coords, false));
}
}
excl根據是否有MUST_NOT條件的Scorer將Scorer進一步封裝成ReqExclScorer(表示需要排除prohibited中的Scorer)或者直接返回。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->opt
private Scorer opt(List<Scorer> optional, int minShouldMatc