lucene英文分詞器(StandarAnalyzer)中會被忽略的詞(stopWords)
阿新 • • 發佈:2018-12-10
使用Lucene進行索引查詢時發現有一部分詞會被分詞器直接忽略掉了,被忽略的分詞稱為stopWords,在英文中通常是一些語氣助詞或者無法表達明確含義的詞。
在定義含有stopWords分詞器的時候都會指定stopWords,如果沒有指定可以引用預設的stopWords,在StandardAnalyzer、StopAnalyzer和ClassicAnalyzer分詞器中stopWords是
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is" , "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
如果想使用自定義的StopWords可以使用lucene提供的StopWordAnalyzer:
public static final String[] self_stop_words={ "a", "an", "and", "are", "as", "at" , "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with",
"very"
};
//Analyzer analyzer=new StopAnalyzer();
Analyzer analyzer=new StopAnalyzer(self_stop_words);