1. 程式人生 > 實用技巧 >陣列如何在ElasticSearch中索引

陣列如何在ElasticSearch中索引

一、簡介

在ElasticSearch裡沒有專門的陣列型別,任何一個欄位都可以有零個和多個值。當欄位值的個數大於1時,欄位型別就變成了陣列。

下面以視訊資料為例,介紹ElasticSearch如何索引陣列資料,以及如何檢索陣列中的欄位值。

測試視訊資料格式如下:

{
    "media_id": 88992211,
    "tags": ["電影","科技","恐怖","電競"]
}

media_id代表視訊id,tags是視訊的標籤,有多個值。業務上需要按視訊標籤檢索標籤下所有的視訊。同一個視訊有多個標籤。

演示使用的ElasticSearch叢集的版本是7.6.2。

二、測試演示

2.1 建立索引

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "type": "text"
      }
    }
  }
}

2.2 向test_arrays索引裡寫入測試資料

POST test_arrays/_doc
{
  "media_id": 887722,
  "tags": [
      "電影",
      "科技",
      "恐怖",
      "電競"
    ]
}

2.3 檢視test_arrays內部如何索引tags欄位

{
  "tokens" : [
    {
      "token" : "電",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "影",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "科",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 102
    },
    {
      "token" : "技",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 103
    },
    {
      "token" : "恐",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 204
    },
    {
      "token" : "怖",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 205
    },
    {
      "token" : "電",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 306
    },
    {
      "token" : "競",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 307
    }
  ]
}

從響應結果可以看到,tags陣列中的每個值被分詞成多個token。

2.4 檢索tags陣列中的值

POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "電影"
    }
  }
}
響應結果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.68324494,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "MyhnpXQBGXOapfjvSpOW",
        "_score" : 0.68324494,
        "_source" : {
          "media_id" : 887722,
          "tags" : [
            "電影",
            "科技",
            "恐怖",
            "電競"
          ]
        }
      }
    ]
  }
}

模糊檢索:
POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "影"
    }
  }
}
響應結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "MyhnpXQBGXOapfjvSpOW",
        "_score" : 0.2876821,
        "_source" : {
          "media_id" : 887722,
          "tags" : [
            "電影",
            "科技",
            "恐怖",
            "電競"
          ]
        }
      }
    ]
  }
}

視訊資料業務上需要通過標籤精確匹配,查詢標籤下的所有視訊。實現這種效果,需要把tags欄位型別修改為keyword。test_arrays索引的mappings設定如下:

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "type": "keyword"
      }
    }
  }
}

此時tags欄位陣列中每一個值對應一個token,可以實現按標籤精準查詢標籤下視訊的效果。

{
  "tokens" : [
    {
      "token" : "電影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "科技",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "恐怖",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "電競",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    }
  ]
}

實際業務場景中,視訊標籤的資料可能不是按陣列儲存的,全部標籤儲存在一個字串中,標籤之間用逗號分隔。

{
    "media_id": 88992211,
    "tags": "電影,科技,恐怖,電競"
}

上面的標籤儲存格式,通過調整索引欄位的型別,同樣可以實現精準檢索單個標籤下視訊的效果。test_arrays索引的配置如下:

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1,
    "analysis" : {
        "analyzer" : {
          "comma_analyzer": {
            "tokenizer": "comma_tokenizer"
          }
        },
        "tokenizer" : {
          "comma_tokenizer": {
            "type": "simple_pattern_split",
            "pattern": ","
          }
        }
      }
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "search_analyzer" : "simple",
        "analyzer" : "comma_analyzer",
        "type" : "text"
      }
    }
  }
}

寫入一條測試資料到test_arrays索引

POST test_arrays/_doc
{
  "media_id": 887722,
  "tags": "電影,科技,恐怖,電競"
}

tags欄位的索引結構如下,同樣實現了一個標籤對應一個token。

{
  "tokens" : [
    {
      "token" : "電影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "科技",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "恐怖",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "電競",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    }
  ]
}

通過標籤精準匹配查詢。

請求引數
POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "電影"
    }
  }
}
響應結果
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "3i2ipXQBGXOapfjv3THH",
        "_score" : 0.2876821,
        "_source" : {
          "media_id" : 887722,
          "tags" : "電影,科技,恐怖,電競"
        }
      }
    ]
  }
}

三、總結

ElasticSearch採用的一種資料型別同時支援單值和多值的設計理念,即簡化了資料型別的總量,同時也降低索引配置的複雜度,是一種非常優秀的設計。

同時標籤資料的組織方式支援陣列和分隔符分隔兩種形式,體現了ElasticSearch功能的靈活性。