1. 程式人生 > >lucene原始碼分析---7

lucene原始碼分析---7

lucene原始碼分析—QueryParser的parse函式

本章主要分析QueryParser類的parse函式,定義在其父類QueryParserBase中,
QueryParserBase::parse

  public Query parse(String query) throws ParseException {
    ReInit(new FastCharStream(new StringReader(query)));
    try {
      Query res = TopLevelQuery(field);
      return res!=null ? res : newBooleanQuery().build();
    } catch
(ParseException | TokenMgrError tme) { } catch (BooleanQuery.TooManyClauses tmc) { } }

parse首先將需要搜尋的字串query封裝成FastCharStream,FastCharStream實現了Java的CharStream介面,內部使用了一個快取,並且可以方便讀取並且改變讀寫指標。然後呼叫ReInit進行初始化,ReInit以及整個QueryParser都是由JavaCC根據org.apache.lucene.queryparse.classic.QueryParser.jj檔案自動生成,設計到的JavaCC的知識可以從網上或者別的書上查詢,本博文不會重點分析這塊內容。
parse最重要的函式是TopLevelQuery,即返回頂層Query,TopLevelQuery會根據用來搜尋的字串query建立一個樹形的Query結構,傳入的引數field在QueryParserBase的建構函式中賦值,用來標識對哪個域進行搜尋。
QueryParserBase::parse->QueryParser::TopLevelQuery

  final public Query TopLevelQuery(String field) throws ParseException {
    Query q;
    q = Query(field);
    jj_consume_token(0);
    {
      if (true) return q;
    }
    throw new Error();
  }

TopLevelQuery函式中最關鍵的是Query函式,由於QueryParser由JavaCC生成,這裡只看QueryParser.jj檔案。

QueryParser.jj::Query

Query Query(String
field) : { List<BooleanClause> clauses = new ArrayList<BooleanClause>(); Query q, firstQuery=null; int conj, mods; } { mods=Modifiers() q=Clause(field) { addClause(clauses, CONJ_NONE, mods, q); if (mods == MOD_NONE) firstQuery=q; } ( conj=Conjunction() mods=Modifiers() q=Clause(field) { addClause(clauses, conj, mods, q); } )* { if (clauses.size() == 1 && firstQuery != null) return firstQuery; else { return getBooleanQuery(clauses); } } }

Modifiers返回搜尋字串中的”+”或”-“,Conjunction返回連線字串。Query首先通過Clause函式返回一個子查詢,然後呼叫addClause函式新增該子查詢,

QueryParserBase::addClause

  protected void addClause(List<BooleanClause> clauses, int conj, int mods, Query q) {
    boolean required, prohibited;

    ...

    if (required && !prohibited)
      clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST));
    else if (!required && !prohibited)
      clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD));
    else if (!required && prohibited)
      clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT));
    else
      throw new RuntimeException("Clause cannot be both required and prohibited");
  }

addClause函式中省略的部分是根據引數連線符conj和mods計算required和prohibited的值,然後將Query封裝成BooleanClause並新增到clauses列表中。

回到Query函式中,如果子查詢clauses列表只有一個子查詢,就直接返回,否則通過getBooleanQuery函式封裝所有的子查詢並最終返回一個BooleanClause。

下面來看Clause函式,即建立一個子查詢,
QueryParser.jj::Clause

Query Clause(String field) : {
  Query q;
  Token fieldToken=null, boost=null;
}
{
  [
    LOOKAHEAD(2)
    (
    fieldToken=<TERM> <COLON> {field=discardEscapeChar(fieldToken.image);}
    | <STAR> <COLON> {field="*";}
    )
  ]

  (
   q=Term(field)
   | <LPAREN> q=Query(field) <RPAREN> (<CARAT> boost=<NUMBER>)?

  )
    {  return handleBoost(q, boost); }
}

LOOKAHEAD(2)表示要看兩個符號,如果是Field,則要重新調整搜尋的域。Clause函式最重要的是Term函式,該函式返回最終的Query,當然Clause函式也可以巢狀呼叫Query函式生成子查詢。

QueryParser.jj::Term

Query Term(String field) : {
  Token term, boost=null, fuzzySlop=null, goop1, goop2;
  boolean prefix = false;
  boolean wildcard = false;
  boolean fuzzy = false;
  boolean regexp = false;
  boolean startInc=false;
  boolean endInc=false;
  Query q;
}
{
  (
     (
       term=<TERM>
       | term=<STAR> { wildcard=true; }
       | term=<PREFIXTERM> { prefix=true; }
       | term=<WILDTERM> { wildcard=true; }
       | term=<REGEXPTERM> { regexp=true; }
       | term=<NUMBER>
       | term=<BAREOPER> { term.image = term.image.substring(0,1); }
     )
     [ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ]
     [ <CARAT> boost=<NUMBER> [ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ] ]
     {
       q = handleBareTokenQuery(field, term, fuzzySlop, prefix, wildcard, fuzzy, regexp);
     }
     | ( ( <RANGEIN_START> {startInc=true;} | <RANGEEX_START> )
         ( goop1=<RANGE_GOOP>|goop1=<RANGE_QUOTED> )
         [ <RANGE_TO> ]
         ( goop2=<RANGE_GOOP>|goop2=<RANGE_QUOTED> )
         ( <RANGEIN_END> {endInc=true;} | <RANGEEX_END>))
       [ <CARAT> boost=<NUMBER> ]
        {
          boolean startOpen=false;
          boolean endOpen=false;
          if (goop1.kind == RANGE_QUOTED) {
            goop1.image = goop1.image.substring(1, goop1.image.length()-1);
          } else if ("*".equals(goop1.image)) {
            startOpen=true;
          }
          if (goop2.kind == RANGE_QUOTED) {
            goop2.image = goop2.image.substring(1, goop2.image.length()-1);
          } else if ("*".equals(goop2.image)) {
            endOpen=true;
          }
          q = getRangeQuery(field, startOpen ? null : discardEscapeChar(goop1.image), endOpen ? null : discardEscapeChar(goop2.image), startInc, endInc);
        }
     | term=<QUOTED>
       [ fuzzySlop=<FUZZY_SLOP> ]
       [ <CARAT> boost=<NUMBER> ]
       { q = handleQuotedTerm(field, term, fuzzySlop); }
  )
  { return handleBoost(q, boost); }
}

如果一個查詢不包括引號(QUOTED),邊界符號(RANGE,例如小括號、中括號等),大部分情況下最終會通過handleBareTokenQuery函式生成一個Term,代表一個詞,然後被封裝成一個子查詢Clause,最後被封裝成一個Query,Clause和Query互相巢狀,即一個Query裡可以包含多個Clause,一個Clause裡又可以從一個Query開始,最終的葉子節點就是Term對應的Query。

QueryParserBase::handleBareTokenQuery

  Query handleBareTokenQuery(String qfield, Token term, Token fuzzySlop, boolean prefix, boolean wildcard, boolean fuzzy, boolean regexp) throws ParseException {
    Query q;

    String termImage=discardEscapeChar(term.image);
    if (wildcard) {
      q = getWildcardQuery(qfield, term.image);
    } else if (prefix) {
      q = getPrefixQuery(qfield,
          discardEscapeChar(term.image.substring
              (0, term.image.length()-1)));
    } else if (regexp) {
      q = getRegexpQuery(qfield, term.image.substring(1, term.image.length()-1));
    } else if (fuzzy) {
      q = handleBareFuzzy(qfield, fuzzySlop, termImage);
    } else {
      q = getFieldQuery(qfield, termImage, false);
    }
    return q;
  }

舉例來說,查詢字串AAA*代表prefix查詢,此時引數prefix為真,A*A代表wildcard查詢,此時引數wildcard為真,AA~代表fuzzy模糊查詢,此時引數fuzzy為真。這裡假設三個都不為真,就是一串平常的單詞,最後會通過getFieldQuery生成一個Query,本文重點分析該函式。

QueryParserBase::handleBareTokenQuery->getFieldQuery

  protected Query getFieldQuery(String field, String queryText, boolean quoted) throws ParseException {
    return newFieldQuery(getAnalyzer(), field, queryText, quoted);
  }

  protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted)  throws ParseException {
    BooleanClause.Occur occur = operator == Operator.AND ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
    return createFieldQuery(analyzer, occur, field, queryText, quoted || autoGeneratePhraseQueries, phraseSlop);
  }

getAnalyzer返回QueryParserBase的init函式中設定的分詞器,這裡為了方便分析,假設為SimpleAnalyzer。quoted以及autoGeneratePhraseQueries表示是否建立PhraseQuery,phraseSlop為位置因子,只有PhraseQuery用得到,這裡不管它。下面來看createFieldQuery函式。
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery

  protected final Query createFieldQuery(Analyzer analyzer, BooleanClause.Occur operator, String field, String queryText, boolean quoted, int phraseSlop) {

    try (TokenStream source = analyzer.tokenStream(field, queryText);
         CachingTokenFilter stream = new CachingTokenFilter(source)) {

      TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
      PositionIncrementAttribute posIncAtt = stream.addAttribute(PositionIncrementAttribute.class);

      int numTokens = 0;
      int positionCount = 0;
      boolean hasSynonyms = false;

      stream.reset();
      while (stream.incrementToken()) {
        numTokens++;
        int positionIncrement = posIncAtt.getPositionIncrement();
        if (positionIncrement != 0) {
          positionCount += positionIncrement;
        } else {
          hasSynonyms = true;
        }
      }

      if (numTokens == 0) {
        return null;
      } else if (numTokens == 1) {
        return analyzeTerm(field, stream);
      } else if (quoted && positionCount > 1) {
        ...
      } else {
        if (positionCount == 1) {
          return analyzeBoolean(field, stream);
        } else {
          return analyzeMultiBoolean(field, stream, operator);
        }
      }
    } catch (IOException e) {

    }
  }

關於分詞器的tokenStream以及incrementToken函式在《lucene原始碼分析—4》中分析過了。直接看最後的結果,假設numTokens==1,則分詞器的輸出結果只有一個詞,則使用analyzeTerm建立最終的Query;
假設positionCount == 1,則表示結果中多個詞出現在同一個位置,此時使用analyzeBoolean建立Query;剩下情況表示有多個詞,至少兩個詞出現在不同位置,使用analyzeMultiBoolean建立Query。本文只分析analyzeMultiBoolean函式,
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery->analyzeMultiBoolean

  private Query analyzeMultiBoolean(String field, TokenStream stream, BooleanClause.Occur operator) throws IOException {
    BooleanQuery.Builder q = newBooleanQuery();
    List<Term> currentQuery = new ArrayList<>();

    TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
    PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);

    stream.reset();
    while (stream.incrementToken()) {
      if (posIncrAtt.getPositionIncrement() != 0) {
        add(q, currentQuery, operator);
        currentQuery.clear();
      }
      currentQuery.add(new Term(field, termAtt.getBytesRef()));
    }
    add(q, currentQuery, operator);

    return q.build();
  }

  private void add(BooleanQuery.Builder q, List<Term> current, BooleanClause.Occur operator) {
    if (current.isEmpty()) {
      return;
    }
    if (current.size() == 1) {
      q.add(newTermQuery(current.get(0)), operator);
    } else {
      q.add(newSynonymQuery(current.toArray(new Term[current.size()])), operator);
    }
  }

  public Builder add(Query query, Occur occur) {
    clauses.add(new BooleanClause(query, occur));
    return this;
  }

分詞器的輸出結果儲存在TermToBytesRefAttribute中,analyzeMultiBoolean函式將同一個起始位置不同的Term新增到列表currentQuery中,如果同一個位置只有一個Term,則將其封裝成TermQuery,如果有多個Term,就封裝成SynonymQuery,TermQuery和SynonymQuery最後被封裝成BooleanClause,新增到BooleanQuery.Builder中的一個BooleanClause列表中。最後通過BooleanQuery.Builder的build函式根據內建的BooleanClause列表建立一個最終的BooleanClause。