1. 程式人生 > >ElasticSearch 學習筆記(二)

ElasticSearch 學習筆記(二)

ElasticSearch的query_string查詢方式

情景介紹

前言:這是我第二次使用 CSDN 發表技術部落格。還有很多不懂的地方不會使用,還請多多包涵,有不對的地方請不吝賜教,在下方留言。

本人在一家跨境電商公司做b2c的專案,商品搜尋引擎這塊由我負責。這幾天老闆演示搜尋商品出現了問題:搜尋雪花秀 ,結果裡出現一大堆和’雪花秀’不相關的商品,發現這些商品都有共同特徵:跟 雪 花 秀 這些單個字有聯絡,於是猜測在搜尋的時候被分詞了,用了分詞後的結果去匹配導致的。 本產品用的ElasticSearch搜尋引擎,下面是搜尋語句:

{
  "size": 50,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "xxxx": {
              "query": "yyyy",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "boost": 1
            }
          }
        },
        {
          "query_string": {
            "query": "text for query",
            "fields": [],
            "default_operator": "or",
            "auto_generate_phrase_queries": false,
            "use_dis_max": true,
            "tie_breaker": 0,
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "split_on_whitespace": true,
            "boost": 1
          }
        }
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": {
    "includes": [],
    "excludes": [
      "categoryVO",
      "brandVO"
    ]
  },
  "sort": [
    {
      "isStock": {
        "order": "desc"
      }
    },
    {
      "saleQuantity": {
        "order": "desc"
      }
    },
    {
      "spuId": {
        "order": "desc"
      }
    }
  ]
}

Query String Query原理

查閱官方文件,是這樣描述的:A query that uses a query parser in order to parse its content.Here is an example:

GET /_search
{
    "query": {
        "query_string" : {
            "default_field" : "content",
            "query" : "this AND that OR thus"
        }
    }
}

The query_string query parses the input and splits text around operators. Each textual part is analyzed independently of each other.For instance the following query:

GET /_search
{
    "query": {
        "query_string" : {
            "default_field" : "content",
            "query" : "(new york city) OR (big apple)"
        }
    }
}

從這看出這個查詢方式居然也有語法規則,這就不難查出問題出現的根源了:只要找到語法規則,知道解析原理,就能針對性解決問題。 這裡介紹下query string 的語法解析原理:在’query’ 欄位裡的內容會被解析成一系列片語或者特殊字元,這些特殊字元組合在一起再根據"query_string"內的欄位設定的值去匹配相關內容,例如: “query"內 輸入"excellent work”,會被解析成"excellent or work"再去匹配結果。再如: “小米手機”,會被解析成"小米",“手機”,“小米手機"再去匹配結果,這裡要注意,被解析成什麼樣是根據index初始定義的analyzer 解析器來的,我這裡用的預設解析器是"ik_max_word”。這裡就明白了為什麼"雪花秀"會匹配出不相關的商品來。因為它被解析成了"雪" “花” "秀"三個字去匹配了。然而需求並非如此,我們需要進行一些微小的修改。

引數含義

Parameter Description 意譯
query The actual query to be parsed 實際會被解析的查詢語句
default_field The default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to *. * extracts all fields in the mapping that are eligible to term queries and filters the metadata fields. All extracted fields are then combined to build a query when no prefix field is provided 匹配域,用於查詢的範圍。預設範圍是_all
default_operator The default operator used if no explicit operator is specified. For example, with a default operator of OR, the query capital of Hungary is translated to capital OR of OR Hungary, and with default operator of AND, the same query is translated to capital AND of AND Hungary. The default value is OR 不做設定的話預設為 OR,比如搜尋 capital of Hungary 會被轉化成 capital OR of OR Hungary.如果設定為 AND,將會被轉換成capital AND of AND Hungary
analyzer The analyzer name used to analyze the query string 解析器名,用來解析查詢語句
quote_analyzer The name of the analyzer that is used to analyze quoted phrases in the query string. For those parts, it overrides other analyzers that are set using the analyzer parameter or the search_quote_analyzer setting 引用部分的語句的解析器,會覆蓋其他地方設定的解析器,優先度最高
fuzziness Set the fuzziness for fuzzy queries. Defaults to AUTO 設定模糊查詢的 規則
minimum_should_match A value controlling how many “should” clauses in the resulting boolean query should match. It can be an absolute value (2), a percentage (30%) or a combination of both 用來控制最小匹配度,可以為一個數字,可以百分比,也可以是兩者混合

上表只展示部分我用過的欄位,更多欄位設定請前往 Elasticsearch官方文件6.4 根據需求,不需要進行分詞匹配,我們可以設定default_operator 為AND,查詢語句即為:

{
  "size": 50,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "xxxx": {
              "query": "yyyy",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "boost": 1
            }
          }
        },
        {
          "query_string": {
            "query": "手",
            "fields": [
              "brand.productBrandDetails.brandName^0.0",
              "detail.spuName^0.0"
            ],
            "use_dis_max": true,
            "tie_breaker": 0,
            "default_operator": "and",
            "auto_generate_phrase_queries": false,
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "split_on_whitespace": true,
            "boost": 1
          }
        }
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": {
    "includes": [],
    "excludes": []
  },
  "sort": []
}

至此,問題得到解決

誤區

設定成 default_operator 為AND後,搜尋精準度提高,但是對於搜尋的文字語句要求也提高了。輸入太過精確的內容反而查詢不到結果,例如:“手機” 能匹配到幾十條記錄,但"大螢幕 手機"就匹配不到記錄,這也降低了使用者體驗,還需要優化,升級方向:單詞片語不進行拆分解析,但是空格會被拆分解析進行匹配。

目前還沒找到一個完善的解決方案,各位大佬有懂的還請留言,本人進入程式設計行業不久,學的java,如有各位大佬相助,想必能跳過很多坑,在此提前感謝。