ElasticSearch- 單節點 unassigned_shards 故障排查

## 故障現象 ![kibana](https://cdn.devopsing.site/2020/20210112231802.png) 在部署ELK的單機環境，當連線Kibana時候提示下面錯誤，即使重啟整個服務也是提示`Kibana server is not ready`. ```text {"message":"all shards failed: [search_phase_execution_exception] all shards failed","statusCode":503,"error":"Service Unavailable"} ``` ## 排查過程前段時間ELK服務還是正常的，進入容器去ping ip 也都沒問題，服務也都是`Up` 狀態； ElasticSearch 服務也可以通過`http://localhost:9200/` 訪問到，但是就是kibana 不能連線ElasticSearch ![ELK](https://cdn.devopsing.site/2020/20210112231919.png) 再檢視 kibana 日誌發現如下資訊, 其中包含了`no_shard_available_action_exception`, 看起來是`分片` 的問題。 ```json { "type": "error", "@timestamp": "2020-09-15T00:41:09Z", "tags": [ "warning", "stats-collection" ], "pid": 1, "level": "error", "error": { "message": "[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]]", "name": "Error", "stack": "[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]] :: {\"path\":\"/.kibana/doc/config%3A6.8.11\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]]\\\"}],routing [null]]" } ``` 通過 [ES視覺化工具-cerebro](https://blog.csdn.net/liumiaocn/article/details/98517815) 檢視 ![cerebro](https://cdn.devopsing.site/2020/20210112231922.png) 實際當時情況是"紅色"的，而不是目前看到的 "黃色"， `heap/disk/cup/load` 基本都是紅色的, 可能因為當時手動刪除了幾個index原因黃色雖然kibana可以訪問ES了，但是`黃色代表ES仍然是不健康的` ### 檢視單節點Elasticsearch健康狀態 `curl -XGET http://localhost:9200/_cluster/health\?pretty` ```json { "cluster_name" : "elasticsearch", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 677, "active_shards" : 677, "relocating_shards" : 0, "initializing_shards" : 4, "unassigned_shards" : 948, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 5, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 599, "active_shards_percent_as_number" : 41.559238796807854 } ``` 從上面的 `unassigned_shards` 可以存在大量分片沒有被分配，當時看到的實際有1000多個。 ### 查詢 UNASSIGNED 型別的索引名字 `curl -XGET http://localhost:9200/_cat/shards` ![UNASSIGNED](https://cdn.devopsing.site/2020/20210112231926.png) 故障原因大概確定了，應該就是`unassigned_shards`導致的下面就看如何解決 ## 解決方案 1. 如果是叢集環境，可以考慮使用 `POST /_cluster/reroute` 強制把問題分片分配到其中一個節點上了 2. 但是對於目前的單機環境，從上面截圖可以看出存在5個 unassigned 的分片，新建索引時候，分片數為5，副本數為1，新建之後叢集狀態成為yellow，其根本原因是因為叢集存在沒有啟用的副本分片。解決辦法就是，在單節點的elasticsearch叢集，刪除存在副本分片的索引，新建索引的副本都設為0。然後再檢視叢集狀態通過如果下命令，設定`number_of_replicas=0`,將副本調整為0. 如下圖所示，es變成了“綠色” ``` shell curl -XPUT 'http://localhost:9200/_settings' -H 'content-Type:application/json' -d' { "number_of_replicas": 0 }' ``` ![Fix-UNASSIGNED](https://cdn.devopsing.site/2020/20210112231930.png) ## 知識點 **副本分片** 主要目的就是為了故障轉移，如果持有主分片的節點掛掉了，一個副本分片就會晉升為主分片的角色。所以副本分片和主分片是不能放到一個節點上面的，可是在只有一個節點的叢集裡，副本分片沒有辦法分配到其他的節點上，所以出現所有副本分片都unassigned得情況。因為只有一個節點，如果存在主分片節點掛掉了，那麼整個叢集理應就掛掉了，不存在副本分片升為主分片的情況。 ## 參考 - - -

ElasticSearch- 單節點 unassigned_shards 故障排查

ElasticSearch- 單節點 unassigned_shards 故障排查

ElasticSearch單節點模式的搭建

單節點Elasticsearch出現unassigned_shards原因及解決辦法

關於elasticsearch node 節點不可用的故障分析

centos7下安裝elasticSearch錯誤總結(單節點模式)

elasticsearch的單節點和分散式的安裝及其操作（使用命令和使用程式碼）

使用kubernetes部署Elasticsearch應用（單節點）

Linux下Elasticsearch-2.4.0的安裝與簡單配置（單節點）

Centos安裝單節點Elasticsearch

Elasticsearch原始碼分析 | 單節點的啟動和關閉

zookeeker如何解決HBase單節點故障

【elasticsearch】初識es搜尋引擎，單節點安裝

kubernetes之故障排查和節點維護(二)

ceph 集群報 mds cluster is degraded 故障排查

配置IP和網絡故障排查

Redis 的安裝與使用（單節點）

mysql 線上故障排查

swift(Object Storage對象存儲服務)(單節點)

Redis分布式緩存安裝（單節點）

分布式架構實戰--ActiveMQ的安裝與使用（單節點）

ElasticSearch- 單節點 unassigned_shards 故障排查

相關推薦