DockerFile構建ElasticSearch映象安裝IK中文分詞器外掛
阿新 • • 發佈:2020-07-29
DockerFile構建ElasticSearch映象安裝IK中文分詞器外掛
為什麼要安裝IK中文分詞器?
ES提供的分詞是英文分詞,對中文做分詞時會拆成單字而不是詞語,非常不好,因此索引資訊含中文時需要使用中文分詞器外掛。
一、環境準備:
- VMWare版本:15.5.5
- 作業系統:CentOS7
- Docker版本:19.03.12
檔案準備:
- 拉取ElasticSearch映象,版本:7.8.0
docker pull elasticsearch:7.8.0
- 下載中文分詞器外掛,版本:7.8.0
# 在Linux根目錄建立docker資料夾並進入資料夾 mkdir /docker cd /docker # 下載IK外掛檔案(如果提示沒有wget命令則先執行:`yum install -y wget`,再執行下載命令) wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.8.0/elasticsearch-analysis-ik-7.8.0.zip # 可選項:wget下載過慢可先用瀏覽器將檔案下載到本地再上傳到Linux(如果提示沒有rz命令則先執行:`yum install -y lrzsz`,再執行上傳命令,選擇elasticsearch-analysis-ik-7.8.0.zip檔案) rz # 解壓(如果提示沒有unzip命令則先執行:`yum install -y unzip`,再執行下載命令) unzip elasticsearch-analysis-ik-7.8.0.zip -d elasticsearch-analysis-ik
注意:ElasticSearch映象版本要與IK分詞器一致(我使用elasticsearch:7.8.1映象與elasticsearch-analysis-ik-7.8.0外掛,構建映象後無法使用)
二、構建映象並啟動:
1. 建立DockerFile:進入docker資料夾執行vi DockerFile
FROM elasticsearch:7.8.0
ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
2. 建立映象:在docker資料夾路徑下執行docker build -f DockerFile -t elasticsearch-ik:7.8.0 .
映象構建成功:
[root@localhost elasticsearch-ik]# docker build -f DockerFile -t elasticsearch-ik:7.8.0 . Sending build context to Docker daemon 14.39MB Step 1/2 : FROM elasticsearch:7.8.0 ---> 121454ddad72 Step 2/2 : ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik ---> Using cache ---> 2af03d5426d3 Successfully built 2af03d5426d3 Successfully tagged elasticsearch-ik:7.8.0
3. 建立並啟動容器
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch_test elasticsearch-ik:7.8.0
4. 驗證ElasticSearch啟動成功:curl localhost:9200
顯示如下即啟動成功:
[root@localhost docker]# curl localhost:9200
{
"name" : "9f832bbeb44a",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "8GAjHyQEToO6PMl8dDoemQ",
"version" : {
"number" : "7.8.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65",
"build_date" : "2020-06-14T19:35:50.234439Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
三、測試分詞器:
這裡使用的是postman
請求url:http://192.168.0.199:9200/_analyze
請求方式:post
在請求體body中請求入參格式:
{
"analyzer": "chinese",
"text": "今天是個好日子"
}
引數說明:
analyzer:可填項有:chinese|ik_max_word|ik_smart,其中chinese是ES的預設分詞器選項,ik_max_word(最細粒度劃分)和ik_smart(最少劃分)是ik中文分詞器選項
text:要進行分詞操作的內容
1. 測試使用預設分詞器
{
"analyzer": "chinese",
"text": "今天是個好日子"
}
結果:
{
"tokens": [
{
"token": "今",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "天",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "個",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "好",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "日",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "子",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
}
]
}
2. 測試使用ik分詞器ik_smart
{
"analyzer": "ik_smart",
"text": "今天是個好日子"
}
結果:
{
"tokens": [
{
"token": "今天是",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "個",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 1
},
{
"token": "好日子",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
}
]
}
3. 測試使用ik分詞器ik_max_word
{
"analyzer": "ik_max_word",
"text": "今天是個好日子"
}
結果:
{
"tokens": [
{
"token": "今天是",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "今天",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 2
},
{
"token": "個",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 3
},
{
"token": "好日子",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "日子",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 5
}
]
}