The index Plan
In order to index the CSV, we want to take two fields from each row, title and description, and turn them into suitable terms. For straightforward textual search we don’t need document values.
Because we’re dealing with free text, and because we know the whole dataset is in English, we can use stemming so that for instance searching for “sundial” and “sundials” will both match the same documents. This way people don’t need to worry too much about exactly which words to use in their query.
Finally, we want a way of separating the two fields. In Xapian this is done using term prefixes, basically by putting short strings at the beginning of terms to indicate which field the term indexes. As well as prefixed terms, we also want to generate unprefixed terms, so that as well as searching within fields you can also search for text in any field.
There are some conventional prefixes used, which is helpful if you ever need to interoperate with omega (a web-based search engine) or other compatible systems. From this, we’ll use ‘S’ to prefix title (it stands for ‘subject’), and for description we’ll use ‘XD’. A full list of conventional prefixes is given at the top of the omega documentation on termprefixes.
When you’re indexing multiple fields like this, the term positions used for each field when indexed unprefixed need to be kept apart. Say you have a title of “The Saints”, and description “Don’t like rabbits? Keep reading.” If you index those fields without a gap, the phrase search “Saints don’t like rabbits” will match, where it really shouldn’t. Usually a gap of 100 between each field is enough.
To write to a database, we use the WritableDatabase class, which allows us to create, update or overwrite a database.
To create terms, we use Xapian’s TermGenerator, a built-in class to make turning free text into terms easier. It will split into words, apply stemming, and then add term prefixes as needed. It can also take care of term positions, including the gap between different fields.
為了對CSV進行索引,我們要從每行中取兩個字段,標題和描述,並將其轉換成合適的term。對於簡單的文本搜索,我們不需要文檔值。
因為我們正在處理自由文本,並且因為我們知道整個數據集是英文的,所以我們可以使用詞幹,例如搜索“sundial”和“sundials”都將匹配相同的文檔。這樣一來,人們不需要太多關心在查詢中使用哪些單詞。
最後,我們想要一種分離這兩個字段的方法。在Xapian中,這是使用trem prefixes完成的,基本上是通過在術語開頭放短字符串來指示術語索引的字段。除了前綴術語之外,我們還要生成無偏見的術語,以便在字段內搜索,也可以在任何字段中搜索文本。
有一些常規的前綴使用,如果您需要與omega(基於Web的搜索引擎)或其他兼容系統進行互操作,這是有幫助的。從此,我們將使用‘S‘來標題(它代表‘subject‘),對於描述,我們將使用‘XD‘。 omega文檔的頂部提供了常規前綴的完整列表。
當您對這樣的多個字段進行索引時,需要將索引未修改的每個字段使用的術語位置分開。說你有一個標題“聖徒”,並描述“不喜歡兔子?繼續讀書。“如果你沒有間隙地索引這些字段,搜索”聖徒不喜歡兔子“這個詞將會匹配,真的不應該。通常每個領域之間的差距就足夠了。
要寫入數據庫,我們使用WritableDatabase類,它允許我們創建,更新或覆蓋數據庫。
要創建條款,我們使用Xapian的TermGenerator,一個內置的類來使自由文本變得更容易。它將分割成單詞,應用詞幹,然後根據需要添加術語前綴。它也可以照顧到職位,包括不同領域之間的差距。
The index Plan