CharTokenizer對西文字元進行分詞處理
CharTokenizer是一個抽象類,它主要是對西文字元進行分詞處理的。常見的英文中,是以空格、標點為分隔符號的,在分詞的時候,就是以這些分隔符作為分詞的間隔符的。
package org.apache.lucene.analysis;
import java.io.IOException;
import java.io.Reader;
// CharTokenizer 是一個抽象類
public abstract class CharTokenizer extends Tokenizer {
public CharTokenizer(Reader input) {
super(input);
}
private int offset = 0, bufferIndex = 0, dataLen = 0;
private static final int MAX_WORD_LEN = 255;
private static final int IO_BUFFER_SIZE = 1024;
private final char[] buffer = new char[MAX_WORD_LEN];
private final char[] ioBuffer = new char[IO_BUFFER_SIZE];
protected abstract boolean isTokenChar(char c);
// 對字元進行處理,可以在CharTokenizer 的子類中實現
protected char normalize(char c) {
return c;
}
// 這個是核心部分,返回分詞後的詞條
public final Token next() throws IOException {
int length = 0;
int start = offset;
while (true) {
final char c;
offset++;
if (bufferIndex >= dataLen) {
dataLen = input.read(ioBuffer);
bufferIndex = 0;
}
;
if (dataLen == -1) {
if (length > 0)
break;
else
return null;
} else
c = ioBuffer[bufferIndex++];
if (isTokenChar(c)) { // if it's a token char
if (length == 0) // start of token
start = offset - 1;
buffer[length++] = normalize(c); // buffer it, normalized
if (length == MAX_WORD_LEN) // buffer overflow!
break;
} else if (length > 0) // at non-Letter w/ chars
break; // return 'em
}
return new Token(new String(buffer, 0, length), start, start + length);
}
}
實現CharTokenizer的具體類有3個,分別為:LetterTokenizer、RussianLetterTokenizer、WhitespaceTokenizer。
先看看LetterTokenizer類,其它的2個都是基於CharTokenizer的,而核心又是next() 方法:
package org.apache.lucene.analysis;
import java.io.Reader;
// 只要讀取到非字元的符號,就分詞
public class LetterTokenizer extends CharTokenizer {
public LetterTokenizer(Reader in) {
super(in);
}
protected boolean isTokenChar(char c) {
return Character.isLetter(c);
}
}
做個測試就可以看到:
package org.shirdrn.lucene;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.LetterTokenizer;
public class LetterTokenizerTest {
public static void main(String[] args) {
Reader reader = new StringReader("That's a world,I wonder why.");
LetterTokenizer ct = new LetterTokenizer(reader);
try {
System.out.println(ct.next());
} catch (IOException e) {
e.printStackTrace();
}
}
}
執行結果如下:
(That,0,4)
在分詞過程中,遇到了單引號,就把單引號之前的作為一個詞條返回。
可以驗證一下,把構造的Reader改成下面的形式:
Reader reader = new StringReader("ThatisaworldIwonderwhy.");
輸出結果為:
(ThatisaworldIwonderwhy,0,22)
沒有非字元的英文字母串就可以作為一個詞條,一個詞條長度的限制為255個字元,可以在CharTokenizer抽象類中看到定義:
private static final int MAX_WORD_LEN = 255;