DockerFile構建ElasticSearch映象安裝IK中文分詞器外掛

阿新 • • 發佈：2020-07-29

DockerFile構建ElasticSearch映象安裝IK中文分詞器外掛

為什麼要安裝IK中文分詞器？

ES提供的分詞是英文分詞，對中文做分詞時會拆成單字而不是詞語，非常不好，因此索引資訊含中文時需要使用中文分詞器外掛。

一、環境準備：

VMWare版本：15.5.5
作業系統：CentOS7
Docker版本：19.03.12

檔案準備：

拉取ElasticSearch映象，版本：7.8.0
docker pull elasticsearch:7.8.0
下載中文分詞器外掛，版本：7.8.0

# 在Linux根目錄建立docker資料夾並進入資料夾
mkdir /docker
cd /docker
# 下載IK外掛檔案(如果提示沒有wget命令則先執行：`yum install -y wget`,再執行下載命令)
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.8.0/elasticsearch-analysis-ik-7.8.0.zip
# 可選項：wget下載過慢可先用瀏覽器將檔案下載到本地再上傳到Linux（如果提示沒有rz命令則先執行：`yum install -y lrzsz`,再執行上傳命令，選擇elasticsearch-analysis-ik-7.8.0.zip檔案）
rz
# 解壓(如果提示沒有unzip命令則先執行：`yum install -y unzip`，再執行下載命令)
unzip elasticsearch-analysis-ik-7.8.0.zip -d elasticsearch-analysis-ik

注意：ElasticSearch映象版本要與IK分詞器一致（我使用elasticsearch:7.8.1映象與elasticsearch-analysis-ik-7.8.0外掛，構建映象後無法使用）

二、構建映象並啟動：

1. 建立DockerFile：進入docker資料夾執行`vi DockerFile`

FROM elasticsearch:7.8.0
ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik

2. 建立映象：在docker資料夾路徑下執行`docker build -f DockerFile -t elasticsearch-ik:7.8.0 .`

映象構建成功：

[root@localhost elasticsearch-ik]# docker build -f DockerFile -t elasticsearch-ik:7.8.0 .
Sending build context to Docker daemon  14.39MB
Step 1/2 : FROM elasticsearch:7.8.0
 ---> 121454ddad72
Step 2/2 : ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
 ---> Using cache
 ---> 2af03d5426d3
Successfully built 2af03d5426d3
Successfully tagged elasticsearch-ik:7.8.0

3. 建立並啟動容器

docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch_test elasticsearch-ik:7.8.0

4. 驗證ElasticSearch啟動成功：`curl localhost:9200`

顯示如下即啟動成功：

[root@localhost docker]# curl localhost:9200
{
  "name" : "9f832bbeb44a",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "8GAjHyQEToO6PMl8dDoemQ",
  "version" : {
    "number" : "7.8.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65",
    "build_date" : "2020-06-14T19:35:50.234439Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

三、測試分詞器：

這裡使用的是postman
請求url:http://192.168.0.199:9200/_analyze
請求方式：post
在請求體body中請求入參格式：

{
    "analyzer": "chinese",
    "text": "今天是個好日子"
}

引數說明：
analyzer：可填項有：chinese|ik_max_word|ik_smart，其中chinese是ES的預設分詞器選項，ik_max_word（最細粒度劃分）和ik_smart（最少劃分）是ik中文分詞器選項
text：要進行分詞操作的內容

1. 測試使用預設分詞器

{
    "analyzer": "chinese",
    "text": "今天是個好日子"
}

結果：

{
    "tokens": [
        {
            "token": "今",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "天",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "是",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "個",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "好",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "日",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "子",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}

2. 測試使用ik分詞器ik_smart

{
    "analyzer": "ik_smart",
    "text": "今天是個好日子"
}

結果：

{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "個",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "好日子",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

3. 測試使用ik分詞器ik_max_word

{
    "analyzer": "ik_max_word",
    "text": "今天是個好日子"
}

結果：

{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "是",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "個",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "好日子",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "日子",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

DockerFile構建ElasticSearch映象安裝IK中文分詞器外掛