線上 ELK 叢集健康值 red 狀態問題排查與解決
阿新 • • 發佈:2019-02-01
之前一直執行正常的資料分析平臺,最近一段時間沒有注意發現日誌索引資料一直未生成,大概持續了n多天,當前狀態: 單臺機器, Elasticsearch(下面稱ES)單節點(空叢集),1000+shrads, 約200G大小。
問題排查
伺服器記憶體,CPU狀態檢查
使用 top
檢視伺服器 cpu
,記憶體等佔用情況,如下圖示(當時樓主的伺服器ES應用的CPU佔用在90%以上,肯定有問題)
記憶體佔用也極高(當時樓主的8G記憶體的伺服器僅剩下150M左右的空閒,肯定是ES的問題)
ES叢集狀態
檢視ES叢集健康值,發現 status
為 red
,這種狀態表示部分主分片不可用,樓主當前的狀態是歷史資料可查,但是無法生成新的 index
curl http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 663,
"active_shards" : 663,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 6,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 99.10313901345292
}
檢視每個索引的狀態,發現大部分索引狀態是 red
,處於不可用狀態,因為開啟的索引資料過多,導致ES佔用大量的CPU,記憶體,使得 logstash
不可用,也就無法建立新的索引資料,從而導致資料丟失。
curl -XGET "http://localhost:9200/_cat/indices?v"
health status index pri rep docs.count docs.deleted store.size pri.store.size
red open jr-2016.12.20 3 0
red open jr-2016.12.21 3 0
red open jr-2016.12.22 3 0
red open jr-2016.12.23 3 0
red open jr-2016.12.24 3 0
red open jr-2016.12.25 3 0
red open jr-2016.12.26 3 0
red open jr-2016.12.27 3 0
ES叢集分片不可用,導致的查詢失敗
查詢ES時丟擲的異常:
[2018-08-06 18:27:24,553][DEBUG][action.search ] [Godfrey Calthrop] All shards failed for phase: [query]
[jr-2018.08.06][[jr-2018.08.06][2]] NoShardAvailableActionException[null]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:129)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:47)
at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52)
at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:582)
at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:85)
at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54)
at org.elasticsearch.rest.RestController.executeHandler(RestController.java:205)
at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166)
at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:128)
at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:86)
at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:449)
at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:61)
問題解決
通過以上排查大概知道是歷史索引資料處於 open 狀態過多,從而導致ES的CPU,記憶體佔用過高導致的不可用。
#關閉不需要的索引,減少記憶體佔用
curl -XPOST "http://localhost:9200/index_name/_close"
小插曲
關閉非熱點索引資料後,樓主的ES叢集的健康值依然是 red 狀態,樓主最後聯想到索引的 red 狀態可能會影響ES的狀態,果不其然如下所示
curl GET http://10.252.148.85:9200/_cluster/health?level=indices
{
"cluster_name": "elasticsearch",
"status": "red",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 660,
"active_shards": 660,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 9,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 98.65470852017937,
"indices": {
"jr-2018.08.06": {
"status": "red",
"number_of_shards": 3,
"number_of_replicas": 0,
"active_primary_shards": 0,
"active_shards": 0,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3
}
}
}
解決方法,刪除這條索引資料(這條資料是樓主排查問題期間產生的髒資料,索引直接刪除)
curl -XDELETE 'http://10.252.148.85:9200/jr-2018.08.06'
小結
當ES處於單點時,應注意ES的索引狀態以及伺服器的監控,及時清理或者關閉不必要的索引資料,避免這種情況發生。技術成長的道路上,與你同行。