lucene原始碼分析---6
lucene原始碼分析—建立IndexReader
本章開始分析lucene的查詢過程,下面先看一段lucene6版本下常用的查詢程式碼,
String indexPath;
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
ScoreDoc[] hits = null;
Query query = null ;
Analyzer analyzer = new SimpleAnalyzer();
try {
QueryParser qp = new QueryParser("body", analyzer);
query = qp.parse(words);
} catch (ParseException e) {
return null;
}
if (searcher != null) {
TopDocs results = searcher.search(query, 20 );
hits = results.scoreDocs;
Document document = null;
for (int i = 0; i < hits.length; i++) {
document = searcher.doc(hits[i].doc);
}
reader.close();
}
indexPath表示索引資料夾的路徑。FSDirectory的open函式前面幾章已經分析過了,最後返回MMapDirectory、SimpleFSDirectory以及NIOFSDirectory其中之一,本章後面都假設為NIOFSDirectory。然後呼叫DirectoryReader的open函式建立一個IndexReader,如下所示,
DirectoryReader::open
public static DirectoryReader open(final Directory directory) throws IOException {
return StandardDirectoryReader.open(directory, null);
}
static DirectoryReader open(final Directory directory, final IndexCommit commit) throws IOException {
return new SegmentInfos.FindSegmentsFile<DirectoryReader>(directory) {
...
}.run(commit);
}
DirectoryReader的open函式呼叫StandardDirectoryReader的open函式,進而呼叫FindSegmentsFile的run函式,最後其實返回一個StandardDirectoryReader。
DirectoryReader::open->FindSegmentsFile::run
public T run() throws IOException {
return run(null);
}
public T run(IndexCommit commit) throws IOException {
if (commit != null) {
...
}
long lastGen = -1;
long gen = -1;
IOException exc = null;
for (;;) {
lastGen = gen;
String files[] = directory.listAll();
String files2[] = directory.listAll();
Arrays.sort(files);
Arrays.sort(files2);
if (!Arrays.equals(files, files2)) {
continue;
}
gen = getLastCommitGeneration(files);
if (gen == -1) {
throw new IndexNotFoundException();
} else if (gen > lastGen) {
String segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen);
try {
T t = doBody(segmentFileName);
return t;
} catch (IOException err) {
}
} else {
throw exc;
}
}
}
假設索引資料夾下有檔案segments_0,segments_1,segments.gen,上面這段程式碼中的getLastCommitGeneration返回1,即以”segments”開頭的檔名裡結尾最大的數字,fileNameFromGeneration返回segments_1。最重要的是doBody函式,用來將檔案中的段以及域資訊讀入記憶體資料結構中,doBody在DirectoryReader的open中被過載,定義如下,
DirectoryReader::open->FindSegmentsFile::run->doBody
protected DirectoryReader doBody(String segmentFileName) throws IOException {
SegmentInfos sis = SegmentInfos.readCommit(directory, segmentFileName);
final SegmentReader[] readers = new SegmentReader[sis.size()];
boolean success = false;
try {
for (int i = sis.size()-1; i >= 0; i--) {
readers[i] = new SegmentReader(sis.info(i), IOContext.READ);
}
DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, false, false);
success = true;
return reader;
} finally {
}
}
doBody首先通過SegmentInfos的readCommit函式讀取段資訊存入SegmentInfos,然後根據該段資訊建立SegmentReader,SegmentReader的建構函式會讀取每個段中的域資訊並存儲在SegmentReader的成員變數裡。先來看SegmentInfos的readCommit函式,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit
public static final SegmentInfos readCommit(Directory directory, String segmentFileName) throws IOException {
long generation = generationFromSegmentsFileName(segmentFileName);
try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {
return readCommit(directory, input, generation);
}
}
public ChecksumIndexInput openChecksumInput(String name, IOContext context) throws IOException {
return new BufferedChecksumIndexInput(openInput(name, context));
}
public IndexInput openInput(String name, IOContext context) throws IOException {
ensureOpen();
ensureCanRead(name);
Path path = getDirectory().resolve(name);
FileChannel fc = FileChannel.open(path, StandardOpenOption.READ);
return new NIOFSIndexInput("NIOFSIndexInput(path=\"" + path + "\")", fc, context);
}
假設傳入的段檔名為segments_1,上面程式碼中的generationFromSegmentsFileName函式返回1。readCommit函式首先通過openChecksumInput建立BufferedChecksumIndexInput,代表檔案的輸入流,其中的openInput函式用來建立NIOFSIndexInput,然後根據該輸入流通過readCommit函式讀取檔案內容,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit
public static final SegmentInfos readCommit(Directory directory, ChecksumIndexInput input, long generation) throws IOException {
int magic = input.readInt();
if (magic != CodecUtil.CODEC_MAGIC) {
throw new IndexFormatTooOldException();
}
int format = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_50, VERSION_CURRENT);
byte id[] = new byte[StringHelper.ID_LENGTH];
input.readBytes(id, 0, id.length);
CodecUtil.checkIndexHeaderSuffix(input, Long.toString(generation, Character.MAX_RADIX));
SegmentInfos infos = new SegmentInfos();
infos.id = id;
infos.generation = generation;
infos.lastGeneration = generation;
if (format >= VERSION_53) {
infos.luceneVersion = Version.fromBits(input.readVInt(), input.readVInt(), input.readVInt());
} else {
}
infos.version = input.readLong();
infos.counter = input.readInt();
int numSegments = input.readInt();
if (format >= VERSION_53) {
if (numSegments > 0) {
infos.minSegmentLuceneVersion = Version.fromBits(input.readVInt(), input.readVInt(), input.readVInt());
} else {
}
} else {
}
long totalDocs = 0;
for (int seg = 0; seg < numSegments; seg++) {
String segName = input.readString();
final byte segmentID[];
byte hasID = input.readByte();
if (hasID == 1) {
segmentID = new byte[StringHelper.ID_LENGTH];
input.readBytes(segmentID, 0, segmentID.length);
} else if (hasID == 0) {
} else {
}
Codec codec = readCodec(input, format < VERSION_53);
SegmentInfo info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ);
info.setCodec(codec);
totalDocs += info.maxDoc();
long delGen = input.readLong();
int delCount = input.readInt();
long fieldInfosGen = input.readLong();
long dvGen = input.readLong();
SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen, dvGen);
if (format >= VERSION_51) {
siPerCommit.setFieldInfosFiles(input.readSetOfStrings());
} else {
siPerCommit.setFieldInfosFiles(Collections.unmodifiableSet(input.readStringSet()));
}
final Map<Integer,Set<String>> dvUpdateFiles;
final int numDVFields = input.readInt();
if (numDVFields == 0) {
dvUpdateFiles = Collections.emptyMap();
} else {
Map<Integer,Set<String>> map = new HashMap<>(numDVFields);
for (int i = 0; i < numDVFields; i++) {
if (format >= VERSION_51) {
map.put(input.readInt(), input.readSetOfStrings());
} else {
map.put(input.readInt(), Collections.unmodifiableSet(input.readStringSet()));
}
}
dvUpdateFiles = Collections.unmodifiableMap(map);
}
siPerCommit.setDocValuesUpdatesFiles(dvUpdateFiles);
infos.add(siPerCommit);
Version segmentVersion = info.getVersion();
if (format < VERSION_53) {
if (infos.minSegmentLuceneVersion == null || segmentVersion.onOrAfter(infos.minSegmentLuceneVersion) == false) {
infos.minSegmentLuceneVersion = segmentVersion;
}
}
}
if (format >= VERSION_51) {
infos.userData = input.readMapOfStrings();
} else {
infos.userData = Collections.unmodifiableMap(input.readStringStringMap());
}
CodecUtil.checkFooter(input);
return infos;
}
readCommit函式較長,歸納起來,就是針對所有的段資訊,讀取並設定id、generation、lastGeneration、luceneVersion、version、counter、minSegmentLuceneVersion、userData等資訊;
並且針對每個段,讀取或設定段名、段ID、該段刪除的文件數、刪除文件的gen數字,域檔案的gen數字,更新的文件的gen數字、該段域資訊檔名、該段更新的檔名,最後將這些資訊封裝成SegmentInfos並返回。
其中,針對每個段,通過segmentInfoFormat函式獲得Lucene50SegmentInfoFormat,呼叫其read函式讀取各個資訊封裝成SegmentInfo,程式碼如下,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentInfos::readCommit->Lucene50SegmentInfoFormat::read
public SegmentInfo read(Directory dir, String segment, byte[] segmentID, IOContext context) throws IOException {
final String fileName = IndexFileNames.segmentFileName(segment, "", Lucene50SegmentInfoFormat.SI_EXTENSION);
try (ChecksumIndexInput input = dir.openChecksumInput(fileName, context)) {
Throwable priorE = null;
SegmentInfo si = null;
try {
int format = CodecUtil.checkIndexHeader(input, Lucene50SegmentInfoFormat.CODEC_NAME,
Lucene50SegmentInfoFormat.VERSION_START,
Lucene50SegmentInfoFormat.VERSION_CURRENT,
segmentID, "");
final Version version = Version.fromBits(input.readInt(), input.readInt(), input.readInt());
final int docCount = input.readInt();
final boolean isCompoundFile = input.readByte() == SegmentInfo.YES;
final Map<String,String> diagnostics;
final Set<String> files;
final Map<String,String> attributes;
if (format >= VERSION_SAFE_MAPS) {
diagnostics = input.readMapOfStrings();
files = input.readSetOfStrings();
attributes = input.readMapOfStrings();
} else {
diagnostics = Collections.unmodifiableMap(input.readStringStringMap());
files = Collections.unmodifiableSet(input.readStringSet());
attributes = Collections.unmodifiableMap(input.readStringStringMap());
}
si = new SegmentInfo(dir, version, segment, docCount, isCompoundFile, null, diagnostics, segmentID, attributes);
si.setFiles(files);
} catch (Throwable exception) {
priorE = exception;
} finally {
CodecUtil.checkFooter(input, priorE);
}
return si;
}
}
該read函式開啟.si檔案,並從中讀取version、docCount、isCompoundFile、diagnostics、attributes、files資訊,然後建立SegmentInfo封裝這些資訊並返回。
回到FindSegmentsFile的doBody函式中,從檔案中所有的段資訊通過readCommit函式封裝成SegmentInfos,然後針對每個段,建立SegmentReader,在其建構函式中讀取域資訊。
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader
public SegmentReader(SegmentCommitInfo si, IOContext context) throws IOException {
this.si = si;
core = new SegmentCoreReaders(si.info.dir, si, context);
segDocValues = new SegmentDocValues();
boolean success = false;
final Codec codec = si.info.getCodec();
try {
if (si.hasDeletions()) {
liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
} else {
liveDocs = null;
}
numDocs = si.info.maxDoc() - si.getDelCount();
fieldInfos = initFieldInfos();
docValuesProducer = initDocValuesProducer();
success = true;
} finally {
}
}
si.info.dir就是索引檔案所在的資料夾,先來看SegmentCoreReaders的建構函式,SegmentCoreReaders的建構函式中會讀取域資訊,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders
SegmentCoreReaders(Directory dir, SegmentCommitInfo si, IOContext context) throws IOException {
final Codec codec = si.info.getCodec();
final Directory cfsDir;
boolean success = false;
try {
if (si.info.getUseCompoundFile()) {
cfsDir = cfsReader = codec.compoundFormat().getCompoundReader(dir, si.info, context);
} else {
cfsReader = null;
cfsDir = dir;
}
coreFieldInfos = codec.fieldInfosFormat().read(cfsDir, si.info, "", context);
final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, coreFieldInfos, context);
final PostingsFormat format = codec.postingsFormat();
fields = format.fieldsProducer(segmentReadState);
if (coreFieldInfos.hasNorms()) {
normsProducer = codec.normsFormat().normsProducer(segmentReadState);
assert normsProducer != null;
} else {
normsProducer = null;
}
fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, coreFieldInfos, context);
if (coreFieldInfos.hasVectors()) {
termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, coreFieldInfos, context);
} else {
termVectorsReaderOrig = null;
}
if (coreFieldInfos.hasPointValues()) {
pointsReader = codec.pointsFormat().fieldsReader(segmentReadState);
} else {
pointsReader = null;
}
success = true;
} finally {
}
}
getUseCompoundFile表示是否會封裝成.cfs、.cfe檔案,如果封裝,就通過compoundFormat函式獲得Lucene50CompoundFormat,然後呼叫其getCompoundReader函式,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->Lucene50CompoundFormat::getCompoundReader
public Directory getCompoundReader(Directory dir, SegmentInfo si, IOContext context) throws IOException {
return new Lucene50CompoundReader(dir, si, context);
}
public Lucene50CompoundReader(Directory directory, SegmentInfo si, IOContext context) throws IOException {
this.directory = directory;
this.segmentName = si.name;
String dataFileName = IndexFileNames.segmentFileName(segmentName, "", Lucene50CompoundFormat.DATA_EXTENSION);
String entriesFileName = IndexFileNames.segmentFileName(segmentName, "", Lucene50CompoundFormat.ENTRIES_EXTENSION);
this.entries = readEntries(si.getId(), directory, entriesFileName);
boolean success = false;
long expectedLength = CodecUtil.indexHeaderLength(Lucene50CompoundFormat.DATA_CODEC, "");
for(Map.Entry<String,FileEntry> ent : entries.entrySet()) {
expectedLength += ent.getValue().length;
}
expectedLength += CodecUtil.footerLength();
handle = directory.openInput(dataFileName, context);
try {
CodecUtil.checkIndexHeader(handle, Lucene50CompoundFormat.DATA_CODEC, version, version, si.getId(), "");
CodecUtil.retrieveChecksum(handle);
success = true;
} finally {
if (!success) {
IOUtils.closeWhileHandlingException(handle);
}
}
}
getCompoundReader用來建立Lucene50CompoundReader。Lucene50CompoundReader的建構函式開啟.cfs以及.cfe檔案,然後通過readEntries函式將其中包含的檔案讀取出來,存入entries中。
回到SegmentCoreReaders的建構函式。fieldInfosFormat返回Lucene60FieldInfosFormat,其read函式用來讀取域資訊,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->Lucene60FieldInfosFormat::read
public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segmentSuffix, IOContext context) throws IOException {
final String fileName = IndexFileNames.segmentFileName(segmentInfo.name, segmentSuffix, EXTENSION);
try (ChecksumIndexInput input = directory.openChecksumInput(fileName, context)) {
Throwable priorE = null;
FieldInfo infos[] = null;
try {
CodecUtil.checkIndexHeader(input,
Lucene60FieldInfosFormat.CODEC_NAME,
Lucene60FieldInfosFormat.FORMAT_START,
Lucene60FieldInfosFormat.FORMAT_CURRENT,
segmentInfo.getId(), segmentSuffix);
final int size = input.readVInt();
infos = new FieldInfo[size];
Map<String,String> lastAttributes = Collections.emptyMap();
for (int i = 0; i < size; i++) {
String name = input.readString();
final int fieldNumber = input.readVInt();
byte bits = input.readByte();
boolean storeTermVector = (bits & STORE_TERMVECTOR) != 0;
boolean omitNorms = (bits & OMIT_NORMS) != 0;
boolean storePayloads = (bits & STORE_PAYLOADS) != 0;
final IndexOptions indexOptions = getIndexOptions(input, input.readByte());
final DocValuesType docValuesType = getDocValuesType(input, input.readByte());
final long dvGen = input.readLong();
Map<String,String> attributes = input.readMapOfStrings();
if (attributes.equals(lastAttributes)) {
attributes = lastAttributes;
}
lastAttributes = attributes;
int pointDimensionCount = input.readVInt();
int pointNumBytes;
if (pointDimensionCount != 0) {
pointNumBytes = input.readVInt();
} else {
pointNumBytes = 0;
}
try {
infos[i] = new FieldInfo(name, fieldNumber, storeTermVector, omitNorms, storePayloads,
indexOptions, docValuesType, dvGen, attributes,
pointDimensionCount, pointNumBytes);
infos[i].checkConsistency();
} catch (IllegalStateException e) {
}
}
} catch (Throwable exception) {
priorE = exception;
} finally {
CodecUtil.checkFooter(input, priorE);
}
return new FieldInfos(infos);
}
}
該read函式開啟.fnm檔案,讀取Field域的基本資訊。然後遍歷所有域,讀取name域名、fieldNumber文件數量,storeTermVector是否儲存詞向量、omitNorms是否儲存norm、storePayloads是否儲存payload、indexOptions域儲存方式、docValuesType文件內容型別、文件的gen、attributes、pointDimensionCount、pointNumBytes,最後封裝成FieldInfo,再封裝成FieldInfos。
回到SegmentCoreReaders建構函式。接下來的postingsFormat函式返回PerFieldPostingsFormat,其fieldsProducer函式最終設定fields為FieldsReader。
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->SegmentCoreReaders::SegmentCoreReaders->PerFieldPostingsFormat::fieldsProducer
public final FieldsProducer fieldsProducer(SegmentReadState state)
throws IOException {
return new FieldsReader(state);
}
normsFormat函式返回Lucene53NormsFormat,Lucene53NormsFormat的normsProducer函式返回Lucene53NormsProducer,賦值給normsProducer。
public NormsProducer normsProducer(SegmentReadState state) throws IOException {
return new Lucene53NormsProducer(state, DATA_CODEC, DATA_EXTENSION, METADATA_CODEC, METADATA_EXTENSION);
}
再往下,依次分析,fieldsReaderOrig最終被賦值為CompressingStoredFieldsReader。termVectorsReaderOrig最終被賦值為CompressingTermVectorsReader。pointsReader最終被賦值為Lucene60PointsReader。
回到SegmentReader建構函式,現在已經讀取了所有的段資訊和域資訊了,接下來如果段中有刪除資訊,就通過liveDocsFormat函式獲得Lucene50LiveDocsFormat,並呼叫其readLiveDocs函式,
DirectoryReader::open->FindSegmentsFile::run->doBody->SegmentReader::SegmentReader->Lucene50LiveDocsFormat::readLiveDocs
public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {
long gen = info.getDelGen();
String name = IndexFileNames.fileNameFromGeneration(info.info.name, EXTENSION, gen);
final int length = info.info.maxDoc();
try (ChecksumIndexInput input = dir.openChecksumInput(name, context)) {
Throwable priorE = null;
try {
CodecUtil.checkIndexHeader(input, CODEC_NAME, VERSION_START, VERSION_CURRENT,
info.info.getId(), Long.toString(gen, Character.MAX_RADIX));
long data[] = new long[FixedBitSet.bits2words(length)];
for (int i = 0; i < data.length; i++) {
data[i] = input.readLong();
}
FixedBitSet fbs = new FixedBitSet(data, length);
return fbs;
} catch (Throwable exception) {
priorE = exception;
} finally {
CodecUtil.checkFooter(input, priorE);
}
}
}
readLiveDocs函式開啟.liv檔案,建立輸入流,然後讀取並建立FixedBitSet用來標識哪些檔案被刪除。
回到SegmentReader建構函式。接下來的initFieldInfos函式將SegmentCoreReaders中的coreFieldInfos賦值給fieldInfos,如果段有更新,就重新讀取一次。docValuesProducer函式最後會返回FieldsReader。
再回到FindSegmentsFile的doBody函式中,最後建立StandardDirectoryReader並返回。StandardDirectoryReader本身的建構函式較為簡單,值得注意的是StandardDirectoryReader的父類CompositeReader的
回到例項中,接下來建立IndexSearcher以及QueryParser,這兩個類的建構函式都沒有關鍵內容,這裡就不往下看了。
值得注意的是IndexSearcher的建構函式會呼叫StandardDirectoryReader的getContext函式,進而呼叫leaves函式,首先是getContext函式,定義在StandardDirectoryReader的父類CompositeReader中,
StandardDirectoryReader::getContext
public final CompositeReaderContext getContext() {
ensureOpen();
if (readerContext == null) {
readerContext = CompositeReaderContext.create(this);
}
return readerContext;
}
ensureOpen用來確保IndexWriter未關閉,接下來通過create函式建立CompositeReaderContext,
CompositeReaderContext::create
static CompositeReaderContext create(CompositeReader reader) {
return new Builder(reader).build();
}
public CompositeReaderContext build() {
return (CompositeReaderContext) build(null, reader, 0, 0);
}
private IndexReaderContext build(CompositeReaderContext parent, IndexReader reader, int ord, int docBase) {
if (reader instanceof LeafReader) {
final LeafReader ar = (LeafReader) reader;
final LeafReaderContext atomic = new LeafReaderContext(parent, ar, ord, docBase, leaves.size(), leafDocBase);
leaves.add(atomic);
leafDocBase += reader.maxDoc();
return atomic;
} else {
final CompositeReader cr = (CompositeReader) reader;
final List<? extends IndexReader> sequentialSubReaders = cr.getSequentialSubReaders();
final List<IndexReaderContext> children = Arrays.asList(new IndexReaderContext[sequentialSubReaders.size()]);
final CompositeReaderContext newParent;
if (parent == null) {
newParent = new CompositeReaderContext(cr, children, leaves);
} else {
newParent = new CompositeReaderContext(parent, cr, ord, docBase, children);
}
int newDocBase = 0;
for (int i = 0, c = sequentialSubReaders.size(); i < c; i++) {
final IndexReader r = sequentialSubReaders.get(i);
children.set(i, build(newParent, r, i, newDocBase));
newDocBase += r.maxDoc();
}
assert newDocBase == cr.maxDoc();
return newParent;
}
}
首先,getSequentialSubReaders函式返回的正是在FindSegmentsFile的doBody函式中為每個段建立的SegmentReader列表,接下來建立CompositeReaderContext,接下來為每個SegmentReader巢狀呼叫build函式並設定進children中,而SegmentReader繼承自LeafReader,因此在巢狀呼叫的build函式中,會將每個SegmentReader封裝為LeafReaderContext並設定進leaves列表中。
因此最後的leaves函式返回封裝了SegmentReader的LeafReaderContext列表。
下一章開始分析QueryParser的parse函式。