python和es簡單互動

阿新 • • 發佈：2021-02-09

技術標籤：Python

from elasticsearch import Elasticsearch


def op_es(action, info):
    """
    這裡大致是一個增刪改的操作，注意這裡的es的id是info['id']，可以類比關係型db裡的主鍵，比如是uuid，
    如果是將關係型db的資料寫入es，完全可以使用db裡的主鍵充當es的id
    :param action: type str，add/delete/edit，類似於我們經常操作關係型資料庫，執行增/刪/改操作
    :param info: type dict，類似於db一條記錄，這條記錄在es裡大致叫做doc，info字典裡的每個key叫做field
    :return: None
    """
    es = Elasticsearch(hosts=['x.x.x.x'])  # 連線es，埠預設9200
    es_index = 'xx-index'  # 即將操作的es index
    if action == 'delete':
        es.delete(index=es_index, id=info['id'])  # 刪除該索引中這個id對應的doc
        return None

    body = {'id': info['id'], 'name': info['name'], 'detail': info['detail']}
    if action == 'add':
        es.create(index=es_index, id=info['id'], body=body)  # 在該索引下建立一個新的doc
    elif action == 'edit':
        es.update(index=es_index, id=info['id'], body={"doc": body})  # 更新已有的某個doc


def query_es(query_string):
    es = Elasticsearch(hosts=['x.x.x.x'])
    es_index = 'xx-index'

    # filter: 詢結果裡能夠看到score都是0，速度要快些
    query = es.search(index=es_index, body={"query": {"bool": {
        "filter": [
            {
                "multi_match": {
                    "query": query_string,
                    'fields': ['name', 'detail']
                }
            }
        ],
    }}, 'size': 20})

    # query: 查詢結果裡能夠看到score計算值，速度要慢些，但是具體還是看資料量以及具體查詢DSL的語法
    query = es.search(index=es_index, body={"query": {
        'multi_match': {
            'query': query_string,
            'fields': ['name', 'detail']
        }
    }, 'size': 20})
    return {'data': [q['_source'] for q in query['hits']['hits']]}

一些說明和可能需要了解的東西

有時間的話可以看下官方文件
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html
如果時間不多，也可以開啟這個文件，看下每個目錄下面的overview簡單地瞭解下


DSL查詢：es查詢方式之一，結構化查詢，使用json請求體，直觀簡易，使用率高。
（新版本有個EQL，處於beta階段，程式碼不如DSL成熟，但是也是提供的，使用不提供任何擔保）


Elasticsearch DSL中有Query與Filter兩種
（1）Query方式查詢，會在ES中索引的資料中儲存一個_score分值，分值越高
     就代表越匹配。注意搜尋的分值計算還是很複雜的（根據lucene的評分機制(TF/IDF)來
     進行評分），因此也需要一定的時間開銷
（2）Filter過濾器方式查詢，它的查詢不會計算相關性分值，也不會對結果
     進行排序, 因此效率會高一點，查詢的結果可以被快取
參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html


es查詢
（1）term：term查詢不會對欄位進行分詞查詢且精確匹配，查詢欄位對映型別屬於為keyword
     參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
（2）match：根據備註資訊模糊查詢 match，match會根據該欄位的分詞器，進行分詞查詢
     參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
（3）multi_match：多欄位模糊匹配查詢
     參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html
（4）match_phrase：同時包含、存在可調節因子slop
     參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
（5）best_fields：希望完全匹配的文件佔的評分比較高
（6）範圍查詢：比如事件段切割gte、lte、format等
（8）布林查詢：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
     must: 文件必須完全匹配條件
     should: should下面會帶一個以上的條件，至少滿足一個條件，這個文件就符合should
     must_not: 文件必須不匹配條件


es的scroll查詢（或遊標查詢）類似一種深度分頁機制，預設10000，python es外掛預設10（需在DSL裡不從size）
可以針對大批量的文件進行查詢，而又不用付出深度分頁那種代價。即取某個時間點的快照資料。
啟用遊標查詢：查詢的時候設定引數 scroll 的值為我們期望的遊標查詢的過期時間，
遊標查詢的過期時間會在每次做查詢的時候重新整理，所以這個時間只需要足夠處理當前批的結果就可以了。
保持這個遊標查詢視窗需要消耗資源，資料處理完畢後儘量早些釋放。
scroll=1m， 保持遊標查詢視窗一分鐘。
注意：
注意遊標查詢每次返回一個新欄位 _scroll_id。每次我們做下一次遊標查詢， 
我們必須把前一次查詢返回的欄位 _scroll_id 傳遞進去。 當沒有更多的結果返回的時候，
我們就處理完所有匹配的文件了。


tf-idf
tf（term frequency，tf）：指的是某一個給定的詞語在該檔案中出現的頻率
idf（inverse document frequency，idf）：逆向檔案頻率是一個詞語普遍重要性的度量
計算公式參考：https://zh.wikipedia.org/wiki/Tf-idf