1. 程式人生 > >lucene原始碼分析---8

lucene原始碼分析---8

lucene原始碼分析—查詢過程

本章開始介紹lucene的查詢過程,即IndexSearcher的search函式,

IndexSearcher::search

  public TopDocs search(Query query, int n)
    throws IOException {
    return searchAfter(null, query, n);
  }
  public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException {
    final int
limit = Math.max(1, reader.maxDoc()); numHits = Math.min(numHits, limit); final int cappedNumHits = Math.min(numHits, limit); final CollectorManager<TopScoreDocCollector, TopDocs> manager = new CollectorManager<TopScoreDocCollector, TopDocs>() { @Override public
TopScoreDocCollector newCollector() throws IOException { ... } @Override public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException { ... } }; return search(query, manager); }

傳入的引數query封裝了查詢語句,n代表取前n個結果。searchAfter函式前面的計算保證最後的文件數量n不會超過所有文件的數量。接下來建立CollectorManager,並呼叫過載的search繼續執行。

IndexSearch::search->searchAfter->search

  public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
    if (executor == null) {
      final C collector = collectorManager.newCollector();
      search(query, collector);
      return collectorManager.reduce(Collections.singletonList(collector));
    } else {

      ...

    }
  }

假設查詢過程為單執行緒,此時executor為空。首先通過CollectorManager的newCollector建立TopScoreDocCollector,每個TopScoreDocCollector封裝了最後的查詢結果,如果是多執行緒查詢,則最後要對多個TopScoreDocCollector進行合併。

IndexSearch::search->searchAfter->search->CollectorManager::newCollector

   public TopScoreDocCollector newCollector() throws IOException {
     return TopScoreDocCollector.create(cappedNumHits, after);
   }

   public static TopScoreDocCollector create(int numHits, ScoreDoc after) {
    if (after == null) {
      return new SimpleTopScoreDocCollector(numHits);
    } else {
      return new PagingTopScoreDocCollector(numHits, after);
    }
  }

引數after用來實現類似分頁的效果,這裡假設為null。newCollector函式最終返回SimpleTopScoreDocCollector。建立完TopScoreDocCollector後,接下來呼叫過載的search函式繼續執行。

IndexSearch::search->searchAfter->search->search

  public void search(Query query, Collector results)
    throws IOException {
    search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
  }

leafContexts是CompositeReaderContext中的leaves成員變數,是一個LeafReaderContext列表,每個LeafReaderContext封裝了每個段的SegmentReader,SegmentReader可以讀取每個段的所有資訊和資料。接下來通過createNormalizedWeight函式進行查詢匹配,並計算一些基本的權重用來給後面的打分過程使用。

  public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
    query = rewrite(query);
    Weight weight = createWeight(query, needsScores);
    float v = weight.getValueForNormalization();
    float norm = getSimilarity(needsScores).queryNorm(v);
    if (Float.isInfinite(norm) || Float.isNaN(norm)) {
      norm = 1.0f;
    }
    weight.normalize(norm, 1.0f);
    return weight;
  }

首先通過rewrite函式對Query進行重寫,例如刪除一些不必要的項,將非原子查詢轉化為原子查詢。

rewrite

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->rewrite

  public Query rewrite(Query original) throws IOException {
    Query query = original;
    for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query;
         rewrittenQuery = query.rewrite(reader)) {
      query = rewrittenQuery;
    }
    return query;
  }

這裡迴圈呼叫每個Query的rewrite函式進行重寫,之所以迴圈是因為可能一次重寫改變Query結構後又產生了可以被重寫的部分,下面假設這裡的query為BooleanQuery,BooleanQuery並不包含真正的查詢語句,而是包含多個子查詢,每個子查詢可以是TermQuery這樣不可再分的Query,也可以是另一個BooleanQuery。
由於BooleanQuery的rewrite函式較長,下面分段來看。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第一部分

  public Query rewrite(IndexReader reader) throws IOException {

    if (clauses.size() == 1) {
      BooleanClause c = clauses.get(0);
      Query query = c.getQuery();
      if (minimumNumberShouldMatch == 1 && c.getOccur() == Occur.SHOULD) {
        return query;
      } else if (minimumNumberShouldMatch == 0) {
        switch (c.getOccur()) {
          case SHOULD:
          case MUST:
            return query;
          case FILTER:
            return new BoostQuery(new ConstantScoreQuery(query), 0);
          case MUST_NOT:
            return new MatchNoDocsQuery();
          default:
            throw new AssertionError();
        }
      }
    }

    ...

  }

如果BooleanQuery中只有一個子查詢,則沒必要對其封裝,直接取出該子查詢中的Query即可。
minimumNumberShouldMatch成員變量表示至少需要匹配多少項,如果唯一的子查詢條件為SHOULD,並且匹配1項就行了,則直接返回對應的Query,如果條件為MUST或者SHOULD,也是直接返回子查詢中的Query,如果條件為FILTER,則直接通過BoostQuery封裝並返回,如果條件為MUST_NOT,則說明唯一的查詢不需要查詢任何文件,直接建立MatchNODocsQuery即可。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第二部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    {
      BooleanQuery.Builder builder = new BooleanQuery.Builder();
      builder.setDisableCoord(isCoordDisabled());
      builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
      boolean actuallyRewritten = false;
      for (BooleanClause clause : this) {
        Query query = clause.getQuery();
        Query rewritten = query.rewrite(reader);
        if (rewritten != query) {
          actuallyRewritten = true;
        }
        builder.add(rewritten, clause.getOccur());
      }
      if (actuallyRewritten) {
        return builder.build();
      }
    }

    ...

  }

這部分rewrite函式遍歷BooleanQuery下的所有的子查詢列表,巢狀呼叫rewrite函式,如果某次rewrite函式返回的Query和原來的Query不一樣,則說明某個子查詢被重寫了,此時通過BooleanQuery.Builder的build函式重新生成BooleanQuery。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第三部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    {
      int clauseCount = 0;
      for (Collection<Query> queries : clauseSets.values()) {
        clauseCount += queries.size();
      }
      if (clauseCount != clauses.size()) {
        BooleanQuery.Builder rewritten = new BooleanQuery.Builder();
        rewritten.setDisableCoord(disableCoord);
        rewritten.setMinimumNumberShouldMatch(minimumNumberShouldMatch);
        for (Map.Entry<Occur, Collection<Query>> entry : clauseSets.entrySet()) {
          final Occur occur = entry.getKey();
          for (Query query : entry.getValue()) {
            rewritten.add(query, occur);
          }
        }
        return rewritten.build();
      }
    }

    ...

  }

clauseSets中儲存了MUST_NOT和FILTER對應的子查詢Clause,並使用HashSet進行儲存。利用HashSet的結構可以去除重複的條件為MUST_NOT和FILTER的子查詢。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第四部分

  public Query rewrite(IndexReader reader) throws IOException {

    ...

    if (clauseSets.get(Occur.MUST).size() > 0 && clauseSets.get(Occur.FILTER).size() > 0) {
      final Set<Query> filters = new HashSet<Query>(clauseSets.get(Occur.FILTER));
      boolean modified = filters.remove(new MatchAllDocsQuery());
      modified |= filters.removeAll(clauseSets.get(Occur.MUST));
      if (modified) {
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        builder.setDisableCoord(isCoordDisabled());
        builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
        for (BooleanClause clause : clauses) {
          if (clause.getOccur() != Occur.FILTER) {
            builder.add(clause);
          }
        }
        for (Query filter : filters) {
          builder.add(filter, Occur.FILTER);
        }
        return builder.build();
      }
    }

    ...

  }

刪除條件為FILTER又同時為MUST的子查詢,同時刪除查詢所有文件的子查詢(因為此時子查詢的數量肯定大於1),查詢所有文件的結果集裡包含了任何其他查詢的結果集。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第五部分

    {
      final Collection<Query> musts = clauseSets.get(Occur.MUST);
      final Collection<Query> filters = clauseSets.get(Occur.FILTER);
      if (musts.size() == 1
          && filters.size() > 0) {
        Query must = musts.iterator().next();
        float boost = 1f;
        if (must instanceof BoostQuery) {
          BoostQuery boostQuery = (BoostQuery) must;
          must = boostQuery.getQuery();
          boost = boostQuery.getBoost();
        }
        if (must.getClass() == MatchAllDocsQuery.class) {
          BooleanQuery.Builder builder = new BooleanQuery.Builder();
          for (BooleanClause clause : clauses) {
            switch (clause.getOccur()) {
              case FILTER:
              case MUST_NOT:
                builder.add(clause);
                break;
              default:
                break;
            }
          }
          Query rewritten = builder.build();
          rewritten = new ConstantScoreQuery(rewritten);

          builder = new BooleanQuery.Builder()
            .setDisableCoord(isCoordDisabled())
            .setMinimumNumberShouldMatch(getMinimumNumberShouldMatch())
            .add(rewritten, Occur.MUST);
          for (Query query : clauseSets.get(Occur.SHOULD)) {
            builder.add(query, Occur.SHOULD);
          }
          rewritten = builder.build();
          return rewritten;
        }
      }
    }

    return super.rewrite(reader);

如果某個MatchAllDocsQuery是唯一的型別為MUST的Query,則對其進行重寫。最後如果沒有重寫,就呼叫父類Query的rewrite直接返回其自身。

看完了BooleanQuery的rewrite函式,下面簡單介紹一下其他型別Query的rewrite函式。
TermQuery的rewrite函式,直接返回自身。SynonymQuery的rewrite函式檢測是否只包含一個Query,如果只有一個Query,則將其轉化為TermQuery。WildcardQuery、PrefixQuery、RegexpQuery以及FuzzyQuery都繼承自MultiTermQuery。WildcardQuery的rewrite函式返回一個封裝了原來Query的MultiTermQueryConstantScoreWrapper。PrefixQuery的rewrite函式返回一個MultiTermQueryConstantScoreWrapper。RegexpQuery類似PrefixQuery。FuzzyQuery最後根據情況返回一個BlendedTermQuery。

回到createNormalizedWeight函式中,重寫完Query之後,接下來通過createWeight函式進行匹配並計算權重。

createWeight

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight

  public Weight createWeight(Query query, boolean needsScores) throws IOException {
    final QueryCache queryCache = this.queryCache;
    Weight weight = query.createWeight(this, needsScores);
    if (needsScores == false && queryCache != null) {
      weight = queryCache.doCache(weight, queryCachingPolicy);
    }
    return weight;
  }

IndexSearch中的成員變數queryCache被初始化為LRUQueryCache。createWeight函式會呼叫各個Query中的createWeight函式,假設為BooleanQuery。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanQuery::createWeight

  public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
    BooleanQuery query = this;
    if (needsScores == false) {
      query = rewriteNoScoring();
    }
    return new BooleanWeight(query, searcher, needsScores, disableCoord);
  }

needsScores在SimpleTopScoreDocCollector中預設返回true。createWeight函式建立BooleanWeight並返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanWeight::createWeight->BooleanWeight::BooleanWeight

  BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException {
    super(query);
    this.query = query;
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    weights = new ArrayList<>();
    int i = 0;
    int maxCoord = 0;
    for (BooleanClause c : query) {
      Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring());
      weights.add(w);
      if (c.isScoring()) {
        maxCoord++;
      }
      i += 1;
    }
    this.maxCoord = maxCoord;

    coords = new float[maxCoord+1];
    Arrays.fill(coords, 1F);
    coords[0] = 0f;
    if (maxCoord > 0 && needsScores && disableCoord == false) {
      boolean seenActualCoord = false;
      for (i = 1; i < coords.length; i++) {
        coords[i] = coord(i, maxCoord);
        seenActualCoord |= (coords[i] != 1F);
      }
      this.disableCoord = seenActualCoord == false;
    } else {
      this.disableCoord = true;
    }
  }

getSimilarity函式預設返回IndexSearcher中的BM25Similarity。BooleanWeight函式巢狀呼叫createWeight獲取子查詢的Weight,假設子查詢為TermQuery,後面來看TermQuery的createWeight函式。maxCoord用來表示有多少個子查詢,最後面的coords陣列能夠影響檢索文件的得分,計算公式為coord(q,d) = q/d。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight

  public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermContext termState;
    if (perReaderTermState == null
        || perReaderTermState.topReaderContext != context) {
      termState = TermContext.build(context, term);
    } else {
      termState = this.perReaderTermState;
    }

    return new TermWeight(searcher, needsScores, termState);
  }

getTopReaderContext返回CompositeReaderContext,封裝了SegmentReader。
perReaderTermState預設為null,因此接下來通過TermContext的build函式進行匹配並獲取對應的Term在索引表中的相應資訊,最後根據得到的資訊TermContext建立TermWeight並返回。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build

  public static TermContext build(IndexReaderContext context, Term term)
      throws IOException {
    final String field = term.field();
    final BytesRef bytes = term.bytes();
    final TermContext perReaderTermState = new TermContext(context);
    for (final LeafReaderContext ctx : context.leaves()) {
      final Terms terms = ctx.reader().terms(field);
      if (terms != null) {
        final TermsEnum termsEnum = terms.iterator();
        if (termsEnum.seekExact(bytes)) { 
          final TermState termState = termsEnum.termState();
          perReaderTermState.register(termState, ctx.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
        }
      }
    }
    return perReaderTermState;
  }

Term的bytes函式返回查詢的位元組,預設的是UTF-8編碼。LeafReaderContext的reader函式返回SegmentReader,對應的terms函式返回FieldReader用來讀取檔案中的資訊。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->SegmentReader::terms

  public final Terms terms(String field) throws IOException {
    return fields().terms(field);
  }

  public final Fields fields() {
    return getPostingsReader();
  }

  public FieldsProducer getPostingsReader() {
    ensureOpen();
    return core.fields;
  }

  public Terms terms(String field) throws IOException {
    FieldsProducer fieldsProducer = fields.get(field);
    return fieldsProducer == null ? null : fieldsProducer.terms(field);
  }

core在SegmentReader建構函式中建立為SegmentCoreReaders,對應fields為PerFieldPostingsFormat。fields.get最終返回BlockTreeTermsReader,在建立索引時設定的。
BlockTreeTermsReader的terms最終返回對應域的FieldReader。

回到TermContext的build函式中,接下來的iterator函式返回SegmentTermsEnum,然後通過seekExact函式查詢匹配,如果匹配,通過SegmentTermsEnum的termState函式返回一個IntBlockTermState,裡面封裝該Term的各個資訊,seekExact函式在下一章分析。build函式最後通過TermContext的register函式儲存計算獲得的IntBlockTermState。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->register

  public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
    register(state, ord);
    accumulateStatistics(docFreq, totalTermFreq);
  }

  public void register(TermState state, final int ord) {
    states[ord] = state;
  }

  public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
    this.docFreq += docFreq;
    if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
      this.totalTermFreq += totalTermFreq;
    else
      this.totalTermFreq = -1;
  }

傳入的引數ord用於標識一個唯一的IndexReaderContext,即一個段。register函式將TermState,其實是IntBlockTermState儲存進陣列states中,然後通過accumulateStatistics更新統計資訊。
回到TermQuery的createWeight函式中,最後建立一個TermWeight並返回。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight

    public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates)
        throws IOException {
      super(TermQuery.this);
      this.needsScores = needsScores;
      this.termStates = termStates;
      this.similarity = searcher.getSimilarity(needsScores);

      final CollectionStatistics collectionStats;
      final TermStatistics termStats;
      if (needsScores) {
        collectionStats = searcher.collectionStatistics(term.field());
        termStats = searcher.termStatistics(term, termStates);
      } else {
        ...
      }

      this.stats = similarity.computeWeight(collectionStats, termStats);
    }

整體上看,collectionStatistics函式用來統計某個域中的資訊,termStatistics函式用來統計某個詞的資訊。
最後通過這兩個資訊呼叫computeWeight函式計算權重。下面分別來看。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::collectionStatistics

  public CollectionStatistics collectionStatistics(String field) throws IOException {
    final int docCount;
    final long sumTotalTermFreq;
    final long sumDocFreq;

    Terms terms = MultiFields.getTerms(reader, field);
    if (terms == null) {
      docCount = 0;
      sumTotalTermFreq = 0;
      sumDocFreq = 0;
    } else {
      docCount = terms.getDocCount();
      sumTotalTermFreq = terms.getSumTotalTermFreq();
      sumDocFreq = terms.getSumDocFreq();
    }
    return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
  }

getTerms函式和前面的分析類似,最後返回一個FieldReader,然後獲取docCount文件數、sumTotalTermFreq所有termFreq(每篇文件有多少個Term)的總和、sumDocFreq所有docFreq(多少篇文件包含Term)的總和,最後建立CollectionStatistics封裝這些資訊並返回。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::termStatistics

  public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
    return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq());
  }

docFreq返回有多少篇文件包含該詞,totalTermFreq返回文件中包含多少個該詞,最後建立一個TermStatistics並返回,建構函式簡單。

回到TermWeight的建構函式中,similarity預設為BM25Similarity,computeWeight函式如下。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight

  public final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);

    float avgdl = avgFieldLength(collectionStats);

    float cache[] = new float[256];
    for (int i = 0; i < cache.length; i++) {
      cache[i] = k1 * ((1 - b) + b * decodeNormValue((byte)i) / avgdl);
    }
    return new BM25Stats(collectionStats.field(), idf, avgdl, cache);
  }

idfExplain函式用來計算idf,即反轉文件頻率。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->idfExplain

  public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {
    final long df = termStats.docFreq();
    final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
    final float idf = idf(df, docCount);
    return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");
  }

df為多少篇文件包含該詞,docCount為文件總數,idf函式的計算公式如下,
1 + log(numDocs/(docFreq+1)),含義是如果文件中出現Term的頻率越高顯得文件越不重要。
回到computeWeight中,avgFieldLength函式用來計算每篇文件包含詞的平均數。

IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->avgFieldLength

  protected float avgFieldLength(CollectionStatistics collectionStats) {
    final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    if (sumTotalTermFreq <= 0) {
      return 1f;
    } else {
      final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
      return (float) (sumTotalTermFreq / (double) docCount);
    }
  }

avgFieldLength函式將詞頻總數除以文件數,得到每篇文件的平均詞數。回到computeWeight中,接下來計算BM25的相關係數,BM25是lucene進行排序的演算法,最後建立BM25Stats並返回。

回到createNormalizedWeight中,接下來通過getValueForNormalization函式計算權重。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization

  public float getValueForNormalization() throws IOException {
    float sum = 0.0f;
    int i = 0;
    for (BooleanClause clause : query) {
      float s = weights.get(i).getValueForNormalization();
      if (clause.isScoring()) {
        sum += s;
      }
      i += 1;
    }

    return sum ;
  }

BooleanWeight的getValueForNormalization函式用來累積子查詢中getValueForNormalization函式返回的值。假設子查詢為TermQuery,對應的Weight為TermWeight,其getValueForNormalization函式如下。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization->TermWeight::getValueForNormalization

    public float getValueForNormalization() {
      return stats.getValueForNormalization();
    }

    public float getValueForNormalization() {
      return weight * weight;
    }

    public void normalize(float queryNorm, float boost) {
      this.boost = boost;
      this.weight = idf.getValue() * boost;
    } 

stats就是是BM25Stats,其getValueForNormalization函式最終返回idf值乘以boost後的平方。

回到createNormalizedWeight中,queryNorm函式直接返回1,normalize函式根據norm重新計算權重。首先看BooleanWeight的normalize函式,

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::normalize

  public void normalize(float norm, float boost) {
    for (Weight w : weights) {
      w.normalize(norm, boost);
    }
  }

假設子查詢對應的Weight為TermWeight。

IndexSearch::search->searchAfter->search->search->createNormalizedWeight->TermWeight::normalize

    public void normalize(float queryNorm, float boost) {
      stats.normalize(queryNorm, boost);
    }

    public void normalize(float queryNorm, float boost) {
      this.boost = boost;
      this.weight = idf.getValue() * boost;
    }

回到IndexSearcher的search函式中,createNormalizedWeight返回Weight後,繼續呼叫過載的search函式,定義如下,

IndexSearch::search->searchAfter->search->search->search

  protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {

    for (LeafReaderContext ctx : leaves) {
      final LeafCollector leafCollector;
      try {
        leafCollector = collector.getLeafCollector(ctx);
      } catch (CollectionTerminatedException e) {

      }
      BulkScorer scorer = weight.bulkScorer(ctx);
      if (scorer != null) {
        try {
          scorer.score(leafCollector, ctx.reader().getLiveDocs());
        } catch (CollectionTerminatedException e) {

        }
      }
    }
  }

根據《lucence原始碼分析—6》leaves是封裝了SegmentReader的LeafReaderContext列表,collector是SimpleTopScoreDocCollector。

IndexSearch::search->searchAfter->search->search->search->SimpleTopScoreDocCollector::getLeafCollector

    public LeafCollector getLeafCollector(LeafReaderContext context)
        throws IOException {
      final int docBase = context.docBase;
      return new ScorerLeafCollector() {

        @Override
        public void collect(int doc) throws IOException {
          float score = scorer.score();
          totalHits++;
          if (score <= pqTop.score) {
            return;
          }
          pqTop.doc = doc + docBase;
          pqTop.score = score;
          pqTop = pq.updateTop();
        }

      };
    }

getLeafCollector函式建立ScorerLeafCollector並返回。

回到search函式中,接下來通過Weight的bulkScorer函式獲得BulkScorer,用來計算得分。

bulkScorer

假設通過createNormalizedWeight函式建立的Weight為BooleanWeight,下面來看其bulkScorer函式,

IndexSearcher::search->searchAfter->search->search->search->BooleanWeight::bulkScorer

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
    final BulkScorer bulkScorer = booleanScorer(context);
    if (bulkScorer != null) {
      return bulkScorer;
    } else {
      return super.bulkScorer(context);
    }
  }

bulkScorer函式首先建立一個booleanScorer,假設為null,下面呼叫其父類Weight的bulkScorer函式並返回。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer

  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

    Scorer scorer = scorer(context);
    if (scorer == null) {
      return null;
    }

    return new DefaultBulkScorer(scorer);
  }

scorer函式重定義在BooleanWeight中,

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer

  public Scorer scorer(LeafReaderContext context) throws IOException {
    int minShouldMatch = query.getMinimumNumberShouldMatch();

    List<Scorer> required = new ArrayList<>();
    List<Scorer> requiredScoring = new ArrayList<>();
    List<Scorer> prohibited = new ArrayList<>();
    List<Scorer> optional = new ArrayList<>();
    Iterator<BooleanClause> cIter = query.iterator();
    for (Weight w  : weights) {
      BooleanClause c =  cIter.next();
      Scorer subScorer = w.scorer(context);
      if (subScorer == null) {
        if (c.isRequired()) {
          return null;
        }
      } else if (c.isRequired()) {
        required.add(subScorer);
        if (c.isScoring()) {
          requiredScoring.add(subScorer);
        }
      } else if (c.isProhibited()) {
        prohibited.add(subScorer);
      } else {
        optional.add(subScorer);
      }
    }

    if (optional.size() == minShouldMatch) {
      required.addAll(optional);
      requiredScoring.addAll(optional);
      optional.clear();
      minShouldMatch = 0;
    }

    if (required.isEmpty() && optional.isEmpty()) {
      return null;
    } else if (optional.size() < minShouldMatch) {
      return null;
    }

    if (!needsScores && minShouldMatch == 0 && required.size() > 0) {
      optional.clear();
    }

    if (optional.isEmpty()) {
      return excl(req(required, requiredScoring, disableCoord), prohibited);
    }

    if (required.isEmpty()) {
      return excl(opt(optional, minShouldMatch, disableCoord), prohibited);
    }

    Scorer req = excl(req(required, requiredScoring, true), prohibited);
    Scorer opt = opt(optional, minShouldMatch, true);

    if (disableCoord) {
      if (minShouldMatch > 0) {
        return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), 1F);
      } else {
        return new ReqOptSumScorer(req, opt);          
      }
    } else if (optional.size() == 1) {
      if (minShouldMatch > 0) {
        return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), coord(requiredScoring.size()+1, maxCoord));
      } else {
        float coordReq = coord(requiredScoring.size(), maxCoord);
        float coordBoth = coord(requiredScoring.size() + 1, maxCoord);
        return new BooleanTopLevelScorers.ReqSingleOptScorer(req, opt, coordReq, coordBoth);
      }
    } else {
      if (minShouldMatch > 0) {
        return new BooleanTopLevelScorers.CoordinatingConjunctionScorer(this, coords, req, requiredScoring.size(), opt);
      } else {
        return new BooleanTopLevelScorers.ReqMultiOptScorer(req, opt, requiredScoring.size(), coords); 
      }
    }
  }

BooleanWeight的scorer函式會迴圈呼叫每個子查詢對應的Weight的scorer函式,假設為TermWeight。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer

    public Scorer scorer(LeafReaderContext context) throws IOException {
      final TermsEnum termsEnum = getTermsEnum(context);
      PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
      return new TermScorer(this, docs, similarity.simScorer(stats, context));
    }

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->getTermsEnum

    private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
      final TermState state = termStates.get(context.ord);
      final TermsEnum termsEnum = context.reader().terms(term.field())
          .iterator();
      termsEnum.seekExact(term.bytes(), state);
      return termsEnum;
    }

首先獲得前面查詢的結果TermState,iterator函式返回SegmentTermsEnum。SegmentTermsEnum的seekExact函式主要是封裝前面的查詢結果TermState,具體的細節下一章再研究。

回到TermWeight的scorer函式中,接下來呼叫SegmentTermsEnum的postings函式,最終呼叫Lucene50PostingsReader的postings函式。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->Lucene50PostingsReader::postings

  public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, PostingsEnum reuse, int flags) throws IOException {

    boolean indexHasPositions = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
    boolean indexHasOffsets = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) >= 0;
    boolean indexHasPayloads = fieldInfo.hasPayloads();

    if (indexHasPositions == false || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) {
      BlockDocsEnum docsEnum;
      if (reuse instanceof BlockDocsEnum) {
        ...
      } else {
        docsEnum = new BlockDocsEnum(fieldInfo);
      }
      return docsEnum.reset((IntBlockTermState) termState, flags);
    } else if ((indexHasOffsets == false || PostingsEnum.featureRequested(flags, PostingsEnum.OFFSETS) == false) &&
               (indexHasPayloads == false || PostingsEnum.featureRequested(flags, PostingsEnum.PAYLOADS) == false)) {
        ...
    } else {
        ...
    }
  }

首先取出索引檔案中的儲存型別。假設進入第一個if語句,reuse引數預設為null,因此接下來建立一個BlockDocsEnum,並通過reset函式初始化。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->BM25Similarity::simScorer

  public final SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException {
    BM25Stats bm25stats = (BM25Stats) stats;
    return new BM25DocScorer(bm25stats, context.reader().getNormValues(bm25stats.field));
  }

  public final NumericDocValues getNormValues(String field) throws IOException {
    ensureOpen();
    Map<String,NumericDocValues> normFields = normsLocal.get();

    NumericDocValues norms = normFields.get(field);
    if (norms != null) {
      return norms;
    } else {
      FieldInfo fi = getFieldInfos().fieldInfo(field);
      if (fi == null || !fi.hasNorms()) {
        return null;
      }
      norms = getNormsReader().getNorms(fi);
      normFields.put(field, norms);
      return norms;
    }
  }

getNormsReader返回Lucene53NormsProducer,Lucene53NormsProducer的getNorms函式根據域資訊以及.nvd、.nvm檔案讀取資訊建立NumericDocValues並返回。simScorer最終建立BM25DocScorer並返回。

回到TermWeight的scorer函式中,最後建立TermScorer並返回。

再回到BooleanWeight的scorer函式中,如果SHOULD條件的Scorer等於minShouldMatch,則表明及時條件為SHOULD但也必須得到滿足,此時將其歸入MUST條件中。再往下,如果SHOULD以及MUST對應的Scorer都為空,則表明沒有任何查詢條件,返回空,如果SHOULD條件的Scorer小於minShouldMatch,則表明SHOULD條件下查詢到的匹配字元太少,也返回空。再往下,如果optional為空,則沒有SHOULD條件的Scorer,此時通過req封裝MUST條件的Scorer,並通過excl排除MUST_NOT條件的Scorer;相反,如果required為空,則沒有MUST條件的Scorer,此時通過opt封裝SHOULD條件的Scorer。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->req

  private Scorer req(List<Scorer> required, List<Scorer> requiredScoring, boolean disableCoord) {
    if (required.size() == 1) {
      Scorer req = required.get(0);

      if (needsScores == false) {
        return req;
      }

      if (requiredScoring.isEmpty()) {
        return new FilterScorer(req) {
          @Override
          public float score() throws IOException {
            return 0f;
          }
          @Override
          public int freq() throws IOException {
            return 0;
          }
        };
      }

      float boost = 1f;
      if (disableCoord == false) {
        boost = coord(1, maxCoord);
      }
      if (boost == 1f) {
        return req;
      }
      return new BooleanTopLevelScorers.BoostedScorer(req, boost);
    } else {
      return new ConjunctionScorer(this, required, requiredScoring,
                                   disableCoord ? 1.0F : coord(requiredScoring.size(), maxCoord));
    }
  }

如果MUST條件下的Scorer數量大於1,則直接建立ConjunctionScorer並返回,如果requiredScoring為空,則對應唯一的MUST條件下的Scorer並不需要評分,此時直接返回FilterScorer,否則通過計算返回BoostedScorer,其他情況下直接返回對應的Scorer。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->excl

  private Scorer excl(Scorer main, List<Scorer> prohibited) throws IOException {
    if (prohibited.isEmpty()) {
      return main;
    } else if (prohibited.size() == 1) {
      return new ReqExclScorer(main, prohibited.get(0));
    } else {
      float coords[] = new float[prohibited.size()+1];
      Arrays.fill(coords, 1F);
      return new ReqExclScorer(main, new DisjunctionSumScorer(this, prohibited, coords, false));
    }
  }

excl根據是否有MUST_NOT條件的Scorer將Scorer進一步封裝成ReqExclScorer(表示需要排除prohibited中的Scorer)或者直接返回。

IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->opt

  private Scorer opt(List<Scorer> optional, int minShouldMatc