倒排索引
阿新 • • 發佈:2017-07-23
key als ron 單詞 end line result std lin
倒排索引(inverted index)
常被成為反向索引、置入文檔和反向檔案,是一種索引方法,被用來存儲在全文搜索下某個單詞在一個文檔
或者一組文檔中的存儲位置的映射。是文檔檢索系統中最常用的數據結構。
例如:
下面是要被索引的文本:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
生成的倒排索引可以表示為下面所示:
"a" = {(2,2)}
"banana" = {(2,3)}
"is" = {(0,1),(0,4),(1,1),(2,1)}
"it" = {(0,0),(0,3),(1,2),(2,0)}
"what" = {(0,2),(1,0)}
我們可以得到這些完全反向索引,有(文檔位置、查詢單詞所在文檔中位置)組成的成對數據。
同樣,文檔位置、和查詢單詞所在文檔中位置,都從零開始計算。
所以,"banana":{(2,3)}表示 banana在第三個文檔中的第四個單詞位置。
=====例子如下:
DATA:存儲正向索引
word_index:存儲倒排索引,每個空格分隔的單詞作為key,
value是list結果,通過list.append方法,依次添加相應單詞在文本文件中的位置()。
單詞位置使用(行中index+所在行號)的形式表示。
#coding:utf-8 import sys DATA = {} word_index = {}# query->(line_no,word_index) #using rever_index #使用倒排結果 def check_index(sentense): query = sentense.split(‘ ‘) for v in query: if word_index.has_key(v)==True: #print word_index[v],"####",v forindex_lineno in word_index[v]: #[‘0.0‘,‘2,1‘,‘2,3‘] #print index_lineno print DATA[int(index_lineno.split(‘.‘)[1])] if __name__ =="__main__": # 生成倒排 line_num = 0 for line in sys.stdin: line = line.strip(‘ \r\n‘) fields = line.split(‘ ‘) DATA[line_num] = line for i, val in enumerate(fields): if word_index.has_key(val) == False: word_index[val] = [] word_index[val].append(".".join( [str(i), str(line_num)])) line_num += 1 print word_index print DATA print "=====test query" queries = "it is example" print ("####input search sentense:%s",queries) print "####search result is :" check_index(queries) print "done==========" sys.exit(0)
=====
input.data 文本文件:
it is what it is what is it it is a banana from your second example When I run the algo using some sample What am I doing wrong ?
======運行結果:
{‘What‘: [‘0.5‘], ‘doing‘: [‘3.5‘], ‘is‘: [‘1.0‘, ‘4.0‘, ‘1.1‘, ‘1.2‘],
‘some‘: [‘6.4‘], ‘it‘: [‘0.0‘, ‘3.0‘, ‘2.1‘, ‘0.2‘], ‘sample‘: [‘7.4‘],
‘second‘: [‘2.3‘], ‘your‘: [‘1.3‘], ‘what‘: [‘2.0‘, ‘0.1‘], ‘from‘: [‘0.3‘],
‘banana‘: [‘3.2‘], ‘?‘: [‘5.5‘], ‘run‘: [‘2.4‘], ‘I‘: [‘1.4‘, ‘2.5‘],
‘When‘: [‘0.4‘], ‘wrong‘: [‘4.5‘], ‘using‘: [‘5.4‘], ‘a‘: [‘2.2‘],
‘am‘: [‘1.5‘], ‘algo‘: [‘4.4‘], ‘the‘: [‘3.4‘], ‘example‘: [‘3.3‘]}
{0: ‘it is what it is‘, 1: ‘what is it‘, 2: ‘it is a banana‘,
3: ‘from your second example‘, 4: ‘When I run the algo using some sample‘,
5: ‘What am I doing wrong ?‘} =====test query (‘####input search sentense:%s‘, ‘it is example‘) ####search result is : it is what it is it is what it is what is it it is a banana it is what it is it is what it is what is it it is a banana from your second example done==========
倒排索引