lucene原始碼分析---7
lucene原始碼分析—QueryParser的parse函式
本章主要分析QueryParser類的parse函式,定義在其父類QueryParserBase中,
QueryParserBase::parse
public Query parse(String query) throws ParseException {
ReInit(new FastCharStream(new StringReader(query)));
try {
Query res = TopLevelQuery(field);
return res!=null ? res : newBooleanQuery().build();
} catch (ParseException | TokenMgrError tme) {
} catch (BooleanQuery.TooManyClauses tmc) {
}
}
parse首先將需要搜尋的字串query封裝成FastCharStream,FastCharStream實現了Java的CharStream介面,內部使用了一個快取,並且可以方便讀取並且改變讀寫指標。然後呼叫ReInit進行初始化,ReInit以及整個QueryParser都是由JavaCC根據org.apache.lucene.queryparse.classic.QueryParser.jj檔案自動生成,設計到的JavaCC的知識可以從網上或者別的書上查詢,本博文不會重點分析這塊內容。
parse最重要的函式是TopLevelQuery,即返回頂層Query,TopLevelQuery會根據用來搜尋的字串query建立一個樹形的Query結構,傳入的引數field在QueryParserBase的建構函式中賦值,用來標識對哪個域進行搜尋。
QueryParserBase::parse->QueryParser::TopLevelQuery
final public Query TopLevelQuery(String field) throws ParseException {
Query q;
q = Query(field);
jj_consume_token(0);
{
if (true) return q;
}
throw new Error();
}
TopLevelQuery函式中最關鍵的是Query函式,由於QueryParser由JavaCC生成,這裡只看QueryParser.jj檔案。
QueryParser.jj::Query
Query Query(String field) :
{
List<BooleanClause> clauses = new ArrayList<BooleanClause>();
Query q, firstQuery=null;
int conj, mods;
}
{
mods=Modifiers() q=Clause(field)
{
addClause(clauses, CONJ_NONE, mods, q);
if (mods == MOD_NONE)
firstQuery=q;
}
(
conj=Conjunction() mods=Modifiers() q=Clause(field)
{ addClause(clauses, conj, mods, q); }
)*
{
if (clauses.size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
}
}
}
Modifiers返回搜尋字串中的”+”或”-“,Conjunction返回連線字串。Query首先通過Clause函式返回一個子查詢,然後呼叫addClause函式新增該子查詢,
QueryParserBase::addClause
protected void addClause(List<BooleanClause> clauses, int conj, int mods, Query q) {
boolean required, prohibited;
...
if (required && !prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.add(newBooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
}
addClause函式中省略的部分是根據引數連線符conj和mods計算required和prohibited的值,然後將Query封裝成BooleanClause並新增到clauses列表中。
回到Query函式中,如果子查詢clauses列表只有一個子查詢,就直接返回,否則通過getBooleanQuery函式封裝所有的子查詢並最終返回一個BooleanClause。
下面來看Clause函式,即建立一個子查詢,
QueryParser.jj::Clause
Query Clause(String field) : {
Query q;
Token fieldToken=null, boost=null;
}
{
[
LOOKAHEAD(2)
(
fieldToken=<TERM> <COLON> {field=discardEscapeChar(fieldToken.image);}
| <STAR> <COLON> {field="*";}
)
]
(
q=Term(field)
| <LPAREN> q=Query(field) <RPAREN> (<CARAT> boost=<NUMBER>)?
)
{ return handleBoost(q, boost); }
}
LOOKAHEAD(2)表示要看兩個符號,如果是Field,則要重新調整搜尋的域。Clause函式最重要的是Term函式,該函式返回最終的Query,當然Clause函式也可以巢狀呼叫Query函式生成子查詢。
QueryParser.jj::Term
Query Term(String field) : {
Token term, boost=null, fuzzySlop=null, goop1, goop2;
boolean prefix = false;
boolean wildcard = false;
boolean fuzzy = false;
boolean regexp = false;
boolean startInc=false;
boolean endInc=false;
Query q;
}
{
(
(
term=<TERM>
| term=<STAR> { wildcard=true; }
| term=<PREFIXTERM> { prefix=true; }
| term=<WILDTERM> { wildcard=true; }
| term=<REGEXPTERM> { regexp=true; }
| term=<NUMBER>
| term=<BAREOPER> { term.image = term.image.substring(0,1); }
)
[ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ]
[ <CARAT> boost=<NUMBER> [ fuzzySlop=<FUZZY_SLOP> { fuzzy=true; } ] ]
{
q = handleBareTokenQuery(field, term, fuzzySlop, prefix, wildcard, fuzzy, regexp);
}
| ( ( <RANGEIN_START> {startInc=true;} | <RANGEEX_START> )
( goop1=<RANGE_GOOP>|goop1=<RANGE_QUOTED> )
[ <RANGE_TO> ]
( goop2=<RANGE_GOOP>|goop2=<RANGE_QUOTED> )
( <RANGEIN_END> {endInc=true;} | <RANGEEX_END>))
[ <CARAT> boost=<NUMBER> ]
{
boolean startOpen=false;
boolean endOpen=false;
if (goop1.kind == RANGE_QUOTED) {
goop1.image = goop1.image.substring(1, goop1.image.length()-1);
} else if ("*".equals(goop1.image)) {
startOpen=true;
}
if (goop2.kind == RANGE_QUOTED) {
goop2.image = goop2.image.substring(1, goop2.image.length()-1);
} else if ("*".equals(goop2.image)) {
endOpen=true;
}
q = getRangeQuery(field, startOpen ? null : discardEscapeChar(goop1.image), endOpen ? null : discardEscapeChar(goop2.image), startInc, endInc);
}
| term=<QUOTED>
[ fuzzySlop=<FUZZY_SLOP> ]
[ <CARAT> boost=<NUMBER> ]
{ q = handleQuotedTerm(field, term, fuzzySlop); }
)
{ return handleBoost(q, boost); }
}
如果一個查詢不包括引號(QUOTED),邊界符號(RANGE,例如小括號、中括號等),大部分情況下最終會通過handleBareTokenQuery函式生成一個Term,代表一個詞,然後被封裝成一個子查詢Clause,最後被封裝成一個Query,Clause和Query互相巢狀,即一個Query裡可以包含多個Clause,一個Clause裡又可以從一個Query開始,最終的葉子節點就是Term對應的Query。
QueryParserBase::handleBareTokenQuery
Query handleBareTokenQuery(String qfield, Token term, Token fuzzySlop, boolean prefix, boolean wildcard, boolean fuzzy, boolean regexp) throws ParseException {
Query q;
String termImage=discardEscapeChar(term.image);
if (wildcard) {
q = getWildcardQuery(qfield, term.image);
} else if (prefix) {
q = getPrefixQuery(qfield,
discardEscapeChar(term.image.substring
(0, term.image.length()-1)));
} else if (regexp) {
q = getRegexpQuery(qfield, term.image.substring(1, term.image.length()-1));
} else if (fuzzy) {
q = handleBareFuzzy(qfield, fuzzySlop, termImage);
} else {
q = getFieldQuery(qfield, termImage, false);
}
return q;
}
舉例來說,查詢字串AAA*代表prefix查詢,此時引數prefix為真,A*A代表wildcard查詢,此時引數wildcard為真,AA~代表fuzzy模糊查詢,此時引數fuzzy為真。這裡假設三個都不為真,就是一串平常的單詞,最後會通過getFieldQuery生成一個Query,本文重點分析該函式。
QueryParserBase::handleBareTokenQuery->getFieldQuery
protected Query getFieldQuery(String field, String queryText, boolean quoted) throws ParseException {
return newFieldQuery(getAnalyzer(), field, queryText, quoted);
}
protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) throws ParseException {
BooleanClause.Occur occur = operator == Operator.AND ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;
return createFieldQuery(analyzer, occur, field, queryText, quoted || autoGeneratePhraseQueries, phraseSlop);
}
getAnalyzer返回QueryParserBase的init函式中設定的分詞器,這裡為了方便分析,假設為SimpleAnalyzer。quoted以及autoGeneratePhraseQueries表示是否建立PhraseQuery,phraseSlop為位置因子,只有PhraseQuery用得到,這裡不管它。下面來看createFieldQuery函式。
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery
protected final Query createFieldQuery(Analyzer analyzer, BooleanClause.Occur operator, String field, String queryText, boolean quoted, int phraseSlop) {
try (TokenStream source = analyzer.tokenStream(field, queryText);
CachingTokenFilter stream = new CachingTokenFilter(source)) {
TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncAtt = stream.addAttribute(PositionIncrementAttribute.class);
int numTokens = 0;
int positionCount = 0;
boolean hasSynonyms = false;
stream.reset();
while (stream.incrementToken()) {
numTokens++;
int positionIncrement = posIncAtt.getPositionIncrement();
if (positionIncrement != 0) {
positionCount += positionIncrement;
} else {
hasSynonyms = true;
}
}
if (numTokens == 0) {
return null;
} else if (numTokens == 1) {
return analyzeTerm(field, stream);
} else if (quoted && positionCount > 1) {
...
} else {
if (positionCount == 1) {
return analyzeBoolean(field, stream);
} else {
return analyzeMultiBoolean(field, stream, operator);
}
}
} catch (IOException e) {
}
}
關於分詞器的tokenStream以及incrementToken函式在《lucene原始碼分析—4》中分析過了。直接看最後的結果,假設numTokens==1,則分詞器的輸出結果只有一個詞,則使用analyzeTerm建立最終的Query;
假設positionCount == 1,則表示結果中多個詞出現在同一個位置,此時使用analyzeBoolean建立Query;剩下情況表示有多個詞,至少兩個詞出現在不同位置,使用analyzeMultiBoolean建立Query。本文只分析analyzeMultiBoolean函式,
QueryParserBase::handleBareTokenQuery->getFieldQuery->newFieldQuery->QueryBuilder::createFieldQuery->analyzeMultiBoolean
private Query analyzeMultiBoolean(String field, TokenStream stream, BooleanClause.Occur operator) throws IOException {
BooleanQuery.Builder q = newBooleanQuery();
List<Term> currentQuery = new ArrayList<>();
TermToBytesRefAttribute termAtt = stream.getAttribute(TermToBytesRefAttribute.class);
PositionIncrementAttribute posIncrAtt = stream.getAttribute(PositionIncrementAttribute.class);
stream.reset();
while (stream.incrementToken()) {
if (posIncrAtt.getPositionIncrement() != 0) {
add(q, currentQuery, operator);
currentQuery.clear();
}
currentQuery.add(new Term(field, termAtt.getBytesRef()));
}
add(q, currentQuery, operator);
return q.build();
}
private void add(BooleanQuery.Builder q, List<Term> current, BooleanClause.Occur operator) {
if (current.isEmpty()) {
return;
}
if (current.size() == 1) {
q.add(newTermQuery(current.get(0)), operator);
} else {
q.add(newSynonymQuery(current.toArray(new Term[current.size()])), operator);
}
}
public Builder add(Query query, Occur occur) {
clauses.add(new BooleanClause(query, occur));
return this;
}
分詞器的輸出結果儲存在TermToBytesRefAttribute中,analyzeMultiBoolean函式將同一個起始位置不同的Term新增到列表currentQuery中,如果同一個位置只有一個Term,則將其封裝成TermQuery,如果有多個Term,就封裝成SynonymQuery,TermQuery和SynonymQuery最後被封裝成BooleanClause,新增到BooleanQuery.Builder中的一個BooleanClause列表中。最後通過BooleanQuery.Builder的build函式根據內建的BooleanClause列表建立一個最終的BooleanClause。