1. 程式人生 > >ElasticSearch50:索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

ElasticSearch50:索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

1.預設的分詞器
standard
standard tokenizer:以單詞的邊界進行切分
standard token filter:什麼都不做
lowercase token filter:將所有字母轉換成小寫
stop token filter(預設被禁用),移除停用詞,比如a the it等等

2.修改分詞器的設定

例子:啟用standard的基於english的分詞器的停用詞token filter
其中,es_std是這個分詞器的名稱
PUT /index0
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}



測試:

使用standard分詞器分詞a little dog

GET /index0/_analyze
{
  "analyzer":"standard",
  "text":"a little dog"
}
執行結果:
{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}


使用設定的es_std分詞器分詞a little dog,可以看到結果中,停用詞過濾了

GET /index0/_analyze
{
  "analyzer":"es_std",
  "text":"a little dog"
}
執行結果

{
  "tokens": [
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}




3.定製化自己的分詞器
例子
char_filter:型別為mapping,定義自己的替換過濾器,這裡我們將&轉換為and,並將這個過濾器起名為&_to_and
my_stopwords:型別為stop,定義自己的停用詞,這裡我們設定了兩個停用詞a和the
my_analyzer:型別為customer,自定義分詞器,分詞前操作:html_strip過濾html程式碼標籤,&_to_and是我們自己定義的字元過濾器(將&提換成and),分詞使用standard,停用詞使用my_stopwords,並將所有的詞轉成小寫

PUT /index0
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=> and"]
        }
      },
      "filter":{
        "my_stopwords":{
          "type":"stop",
          "stopwords":["a","the"]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase","my_stopwords"]
        }
      }
    }
  }
}


執行:報錯,索引已存在,
{
  "error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
        "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
        "index": "index0"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
    "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
    "index": "index0"
  },
  "status": 400
}

我們先刪除這個索引 DELETE /index0,然後再執行
執行成功:
{
  "acknowledged": true,
  "shards_acknowledged": true
}


測試我們的分詞器my_analyzer:
模擬一段文字:tom and jery in the a house <a> & me HAHA
從執行結果中可以看出,a和the過濾了,HAHA轉成了小寫,&轉成了and,<a>標籤過濾了

GET /index0/_analyze
{
  "analyzer": "my_analyzer",
  "text":"tom and jery in the a house <a> & me HAHA"
}

執行結果

{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jery",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "house",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "and",
      "start_offset": 32,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "me",
      "start_offset": 34,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "haha",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}




4.在我們的索引中使用我們自定義的分詞器
設定mytype中的欄位content使用我們的自定義的分詞器my_analyzer
GET /index0/_mapping/my_type
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"my_analyzer"
        }
    }
}