技術筆記外傳——用whoosh搭建自己的搜尋框架（一）

阿新 • • 發佈：2018-12-20

在之前的博文中，我對haystack進行了諸多方面的吐槽，因此就產生了擺脫haystack的想法，而是利用whoosh搜尋庫自己實現搜尋功能。為了提升搜尋功能的通用性，我將其也設計成一個即插即用的app，算是自己實現了一個簡單的搜尋框架——blogsearchengine。

由於這個搜尋框架目前的服務物件是基於我們的個人部落格，因此將其命名為blogsearchengine。然而，作為一個具備通用性的搜尋框架，顯然它不僅能搜尋我們的部落格，還可以根據使用者的設定來搜尋其他django的模型資料，並且根據使用者指定條件來對搜尋範圍進行更新和過濾。此外，blogsearchengine還提供了兩種預設的搜尋表單，可讓使用者根據自己的喜好來設定搜尋條件。另外，雖然我之前吐槽過haystack的View類的設計，然而blogsearchengine也提供了預設的View類用於顯示搜尋結果。

blogsearchengine目前包括三大部分：1、搜尋引擎searchengine；2、兩種搜尋表單；3、一個搜尋結果View類。searchengine類顯然是這個框架的核心部分，它包含了建立索引、更新索引以及提供搜尋結果幾個核心的功能；搜尋表單包括一個基礎表單和一個帶單選框的表單，前者可以讓使用者使用簡單搜尋功能，而後者可以讓使用者在選定的範圍內進行搜尋；而View類免去了使用者再去設計後端檢視的工作，只需傳入自己的模板檔名即可得到現成的搜尋結果。

這是採用了blogsearchengine框架後的搜尋頁面和搜尋結果：

搜尋表單使用的是帶單選框的表單，可以根據使用者選擇在指定範圍中搜索。

這裡是搜尋結果，關鍵字已被加粗高亮。

在這期部落格中，首先為大家介紹blogsearchengine的核心部分——搜尋引擎searchengine。

一 whoosh搜尋庫

在介紹搜尋引擎之前，有必要介紹一下whoosh的概念。whoosh是python實現的一套索引庫。它提供了相當多的函式和類用於讓使用者對自己的文件建立索引，並通過給定的條件來對這些建立了索引的文件進行搜尋。與solr和elasticsearch相比，whoosh本身就是基於python開發的，而solr和elasticsearch則是用java實現，使用whoosh可以免去一些環境配置工作。

whoosh具備以下特點：1、速度快，使用純python解析，不需要編譯器；2、whoosh使用BM25F作為排序演算法，更方便自定義；3、whoosh建立的索引相比其他索引庫更小；4、whoosh支援儲存任意的python物件。

此外，whoosh的概念相比solr和elasticsearch更簡單一些，對於初步接觸搜尋的人，不用一上來就考慮分散式之類的東西，更加容易上手。

因此，基於whoosh的以上幾個優點（特點），我選用whoosh作為這個搜尋框架的後端。

二 searchengine的設計與實現

由於我們的目的是要實現一個通用的搜尋框架，因此我們在設計時要考慮以下幾個需求的實現：

1、支援任意django模型的索引；

2、支援使用者指定索引檔案的存放目錄；

3、在更新索引時，可根據使用者指定的條件變化進行更新；

4、提供搜尋方法，支援搜尋指定的欄位，並返回高亮搜尋結果。

我們仿照haystack，將其設計為一個即插即用的django app，因此我們首先需要建立起blogsearchengine的app。

在myblog目錄下，輸入以下命令建立app：

python manage.py startapp blogsearchengine

然後，我們在app下新建一個engine.py檔案，開始實現我們的搜尋引擎部分。

我們將以上4個需求都集中在一個searchengine類中，並且將建立索引的部分都封裝在類中。這樣，當使用這個框架時，使用者只需與它的搜尋方法打交道即可，大大節省了建立索引的時間。除了搜尋方法外，我們還提供更新索引和匯入額外資料的介面供使用者使用，以便使用者可以手動更新索引，以及向搜尋結果中新增自己的額外資料。

首先來看它的建構函式：

# blogsearchengine/searchengine.py
# ...
import os
class searchengine:

    def __init__(self, model, updatefield, indexpath = None, indexname = None,formatter = None):
        self.model = model
        self.indexpath = indexpath
        self.indexname = indexname
        self.updatefield = updatefield
        self.indexschema = {}
        self.formatter = BlogFormatter
        # 建立index存放路徑
        if self.indexpath is None:
            self.indexpath = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'engineindex/')
        if self.indexname is None:
            self.indexname = model.__name__
        if formatter is not None:
            self.formatter = formatter
        self.__buildSchema()
        self.__buildindex()
# ...

可以看到，建構函式提供了相當多的引數供使用者來呼叫。其中，model和updatefield兩個引數是必傳的，前者是欲建立索引的django模型類物件，而後者作為更新索引的依據；indexpath和indexname顧名思義，對應索引的存放路徑和索引的名稱；最後一個formatter作為高亮顯示類，searchengine會提供一個預設的BlogFormatter類，這個類之後會講到。

建構函式的主要目的是用於為這些成員變數賦值，並且指定存放索引的目錄。最後的兩個函式__buildSchema和__buildindex則是用來建立索引的關鍵函式，用於對model物件建立索引。

# blogsearchengine/searchengine.py
# ...
from django.db.models import *
from whoosh.fields import *
from whoosh.index import create_in,exists,exists_in
from whoosh.filedb.filestore import FileStorage
from ckeditor_uploader.fields import RichTextUploadingField
...
class searchengine:

    def __init__(self, model, updatefield, indexpath = None, indexname = None,formatter = None):
    # ...
        
    # 為某個model建立schema
    def __buildSchema(self):
        self.indexschema = {}
        modlefields = self.model._meta.get_fields()
        for field in modlefields:
            if type(field) == CharField:
                self.indexschema[field.__str__().split('.')[-1]] = TEXT(stored=True)
            elif type(field) == IntegerField:
                self.indexschema[field.__str__().split('.')[-1]] = NUMERIC(stored=True,numtype=int)
            elif type(field) == FloatField:
                self.indexschema[field.__str__().split('.')[-1]] = NUMERIC(stored=True,numtype=float)
            elif type(field) == DateField or type(field) == DateTimeField:
                self.indexschema[field.__str__().split('.')[-1]] = DATETIME(stored=True)
            elif type(field) == BooleanField:
                self.indexschema[field.__str__().split('.')[-1]] = BOOLEAN(stored=True)
            elif type(field) == AutoField:
                self.indexschema[field.__str__().split('.')[-1]] = STORED()
            elif type(field) == RichTextUploadingField:
                self.indexschema[field.__str__().split('.')[-1]] = TEXT(stored=True)

    def __buildindex(self):
        #schemadict = self.__buildSchema()
        document_dic = {}
        # defaultFolderPath = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'engineindex/')
        if self.indexschema is None:
            return False

        if not os.path.exists(self.indexpath):
            os.mkdir(self.indexpath)

        modelSchema = Schema(**self.indexschema)
        if not exists_in(self.indexpath,indexname=self.indexname):
            ix = create_in(self.indexpath,modelSchema,indexname=self.indexname)
            print('index is created')
            writer = ix.writer()
            # 將model物件依次加入index中
            objectlist = self.model.objects.all()
            for obj in objectlist:
                for key in self.indexschema:
                    if hasattr(obj,key):
                        # print(key,getattr(obj,key.split('.')[-1]))
                        document_dic[key] = getattr(obj,key)
                writer.add_document(**document_dic)
                document_dic.clear()
            writer.commit()
            print('all blog has indexed')

讓我們來看__buildSchema函式。在whoosh建立索引時，我們需要傳入一個字典形式的schema來告訴whoosh每個欄位需要建立什麼類別的索引列，因此我們需要將模型的每個欄位遍歷一次，根據其型別選擇合適的whoosh索引列。我們的第一條需求要求我們要支援任意的django模型，因此我們不能用hardcode的方式將model的欄位寫死在這裡，而是使用model._meta.get_fields()方法拿到任意model的所有欄位，然後再指定其所對應的索引列。通常來說，每個django model的欄位型別都可在whoosh中找到對應的型別，一一對應好即可。而對於id這種只需儲存而無需搜尋的欄位，我們可以選用STORED索引列進行儲存。

這裡要注意的一點是，通過get_fields()方法返回的欄位名為完整格式，即包含app級別的（如blogs.Blog.title)，這裡為了key的簡潔，我們只取最後一位即可。

在建立好索引schema後，我們就可以呼叫__buildindex函式來建立真正的索引了。__buildindex主要的工作有兩個：1、根據使用者傳入的indexpath（或預設的indexpath)建立目錄；2、把指定model的每個物件都加入到索引的範圍，以便之後可以搜尋。whoosh也是通過文件庫的概念來對內容進行索引的，因此我們要索引的每個物件都要轉化為whoosh的一篇文件。

我們使用whoosh的Schema類來建立一個Schema物件，並將我們剛剛建好的schema字典傳入。然後我們使用exists_in來判斷指定的目錄中是否存在指定名字的索引，當其不存在時，我們才建立新的索引。接著我們通過create_in函式按照之前的schema物件和索引名稱來建立這個索引。剛建好的索引可以看成一個空的文件庫，裡面沒有任何內容。因此我們通過一個二重迴圈將每個物件的每個欄位和值存入document_dict，再呼叫writer的add_document方法將其存入庫中。

當model的每個物件都存入了索引後，調一句writer.commit()，將這些物件徹底commit到索引中，這樣我們就完成了索引的初始化。

在索引初始化之後，我們該怎樣根據資料的變化來更新我們的索引呢？我將在下篇部落格中為大家介紹searchengine的update部分，敬請期待～

技術筆記外傳——用whoosh搭建自己的搜尋框架（一）

一 whoosh搜尋庫

二 searchengine的設計與實現

技術筆記外傳——用whoosh搭建自己的搜尋框架（一）

技術筆記外傳——用whoosh搭建自己的搜尋框架（二）

Asp.net MVC 搭建屬於自己的框架（一）

用java搭建一個分散式伺服器（一）

從零搭建ES搜尋服務（一）基本概念及環境搭建

qt 利用執行緒池和鎖搭建客戶端框架（一）

IntelliJ IDEA+springboot+jdbctemplet+easyui+maven+oracle搭建簡易開發框架（一）

LWIP學習筆記之用戶編程接口（NETCONN）(八)

Win10下用IDEA搭建Struts2+Spring4+Hibernate5（SSH）框架，實現使用者登入註冊

基於谷歌開源的TensorFlow Object Detection API視訊物體識別系統搭建自己的應用（三）

【新手建站三部曲之一】——一塊錢搭建自己的伺服器（LAMP）

使用Kafka、Elasticsearch、Grafana搭建業務監控系統（一）技術選擇

自己實現Struts2（一）Struts流程介紹和環境搭建

用Vue搭建一個應用盒子（二）：datetime-picker

IDEA搭建maven項目（一）

如何搭建一個web網站（一）

從零開始利用vue-cli搭建簡單音樂網站（一）

利用 Composer 一步一步構建自己的 PHP 框架（一）

react搭建後臺管理系統（一）

Windows環境下，從零開始搭建Nodejs+Express+Ejs框架（一）---安裝nodejs

技術筆記外傳——用whoosh搭建自己的搜尋框架（一）

一 whoosh搜尋庫

二 searchengine的設計與實現

相關推薦