ElasticSearch50：索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

阿新 • • 發佈：2018-12-22

1.預設的分詞器
standard
standard tokenizer:以單詞的邊界進行切分
standard token filter:什麼都不做
lowercase token filter:將所有字母轉換成小寫
stop token filter(預設被禁用)，移除停用詞，比如a the it等等

2.修改分詞器的設定

例子：啟用standard的基於english的分詞器的停用詞token filter
其中，es_std是這個分詞器的名稱

PUT /index0
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}

測試：

使用standard分詞器分詞a little dog

GET /index0/_analyze
{
  "analyzer":"standard",
  "text":"a little dog"
}
執行結果：
{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

使用設定的es_std分詞器分詞a little dog，可以看到結果中，停用詞過濾了

GET /index0/_analyze
{
"analyzer":"es_std",
"text":"a little dog"
}
執行結果

{
  "tokens": [
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

3.定製化自己的分詞器
例子
char_filter:型別為mapping，定義自己的替換過濾器，這裡我們將&轉換為and，並將這個過濾器起名為&_to_and
my_stopwords:型別為stop，定義自己的停用詞，這裡我們設定了兩個停用詞a和the
my_analyzer：型別為customer，自定義分詞器，分詞前操作：html_strip過濾html程式碼標籤，&_to_and是我們自己定義的字元過濾器（將&提換成and)，分詞使用standard，停用詞使用my_stopwords,並將所有的詞轉成小寫

PUT /index0
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=> and"]
        }
      },
      "filter":{
        "my_stopwords":{
          "type":"stop",
          "stopwords":["a","the"]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase","my_stopwords"]
        }
      }
    }
  }
}

執行：報錯，索引已存在，
{
"error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
        "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
        "index": "index0"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
    "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
    "index": "index0"
},
"status": 400
}

我們先刪除這個索引 DELETE /index0，然後再執行
執行成功：
{
"acknowledged": true,
"shards_acknowledged": true
}

測試我們的分詞器my_analyzer：
模擬一段文字：tom and jery in the a house <a> & me HAHA
從執行結果中可以看出，a和the過濾了，HAHA轉成了小寫，&轉成了and，<a>標籤過濾了

GET /index0/_analyze
{
"analyzer": "my_analyzer",
"text":"tom and jery in the a house <a> & me HAHA"
}

執行結果

{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jery",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "house",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "and",
      "start_offset": 32,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "me",
      "start_offset": 34,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "haha",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

4.在我們的索引中使用我們自定義的分詞器
設定mytype中的欄位content使用我們的自定義的分詞器my_analyzer
GET /index0/_mapping/my_type
{
   "properties":{
       "content":{
           "type":"text",
           "analyzer":"my_analyzer"
       }
   }
}

ElasticSearch50：索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

ElasticSearch50：索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

ElasticSearch36：初識搜尋引擎_快速上機動手實戰Query DSL搜尋語法

elasticsearch 筆記十八：索引管理

ElasticSearch最佳入門實踐（六十四）索引管理_定製化自己的dynamic mapping

深入理解ElasticSearch（八）：索引管理

『PyTorch』第五彈_深入理解Tensor對象_中上：索引

Linux系統管理_主題02 ：管好文件（1）_2.2 列出文件和文件屬性_chmod_ls

Linux系統管理_主題02 ：管好文件（1）_2.1 切換、創建和刪除目錄_cd_mkdir_rmdir

elasticsearch系列三：索引詳解（分詞器、文檔管理、路由詳解）

python快速學習系列（9）：上下文管理器

學習MongoDB--（5-2）：索引（檢視索引的使用，管理索引）

elasticsearch快速入門案例實戰之電商網站商品管理：叢集健康檢查，CRUD

Elasticsearch 7.x：2、索引管理

第127講：Hadoop叢集管理之安全模式解析及動手實戰學習筆記

快速入門案例實戰之電商網站商品管理：巢狀聚合，下鑽分析，聚合分析

Vue.js：使用vue-cli快速構建項目

實例：供應商管理報表需求調研報告

遊戲UI框架設計(五)：配置管理與應用

git subtree：無縫管理通用子項目

SQL Server性能調優：資源管理之內存管理篇（上）

ElasticSearch50：索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器

相關推薦