Eureka Server叢集重啟問題追蹤
問題
在生產環境重啟Eureka Server叢集的時候,發現訂單客戶端呼叫分散式Id生成服務出錯,
Caused by: com.netflix.client.ClientException: Load balancer does not have available server for client: IDG
for client: IDG
顯示訂單服務調不到IDG服務了
問題思考
Eureka Client快取由一個定時執行緒去重新整理,每30秒執行一次增量更新,ribbon每30秒從Eureka Client的本地快取裡面獲取服務的資訊,上面的錯誤,是有ribbon報出來的,說明ribbon裡面IDG服務的資訊不存在, 通過後續除錯,發現Eureka Client的本地快取是空的。 由此引發了一個問題,當Eureka Server正在重啟或者重啟完成,Eureka Client來獲取註冊資訊,然後更新到本地出了問題
問題追蹤
檢查Eureka Client日誌
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 0 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 5000 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 200 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null] 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 0 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 5000 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 200 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null] 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check
發現如上片段的日誌,當客戶端的CacheRefreshExecutor(快取重新整理執行緒池) 執行任務的時候
第1行 : 獲取增量更新資料的hashCode
第2行 : 獲取到的增量資料總數為0
第3行 : 節點合併之後,增量資料(服務端)的HashCode和本地client端的HashCode不一致, client = UP5 , Server = “” , 因此需要發起全量獲取
第4..7行 : 發起全量獲取。
發生問題的原因已經很明顯了,就是在Eureka Server重啟的時候,註冊資訊為空,剛好被Eureka Client獲取到,由於HashCode計算不一致
導致發起全量獲取,然後覆蓋本地的快取資料。 導致本地的快取資料更新為錯誤的,由此發生呼叫問題。
通過檢查Eureka Server的配置,發現如下問題:
eureka:
instance:
hostname: server2
client:
serviceUrl:
defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
fetch-registry: false
register-with-eureka: true // 將自身註冊到Eureka 叢集上面去
eureka:
instance:
hostname: server2
client:
serviceUrl:
defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
fetch-registry: false
register-with-eureka: true // 將自身註冊到Eureka 叢集上面去
fetch-registry = false , 這就表明當Eureka Server作為Client註冊到Eureka叢集上面去的時候,預設是不會去全量抓取註冊資訊的 。 但是Eureka Server作為服務端的時候,在服務剛剛啟動的時候,會從本地client獲取註冊資訊(register-with-eureka: true時,他本身也作為客戶端註冊到Eureka上去了),然後註冊到自身的服務上去。 想了解具體詳情可以看:深入理解Eureka Server叢集同步(十)
也就是說Eureka Server剛剛啟動的時候,他作為server端的註冊資訊是空的。 只能依賴後續叢集續約同步的方式,慢慢補全自身的資訊。
通過上面的瞭解,將配置修改成下面這樣:
eureka:
instance:
hostname: server2
client:
serviceUrl:
defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
fetch-registry: true
register-with-eureka: true // 將自身註冊到Eureka 叢集上面去
eureka:
instance:
hostname: server2
client:
serviceUrl:
defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
fetch-registry: true
register-with-eureka: true // 將自身註冊到Eureka 叢集上面去
將fetch-register修改為true, 這樣在Eureka Server 剛剛啟動的時候,就可以將註冊資訊全部註冊到自己的節點上去。
通過併發測試,發現剛剛那個配置只是減小了機率,並不能做到完全避免, 原因如下:
protected void initEurekaServerContext() throws Exception {
// .....省略N多程式碼
// 從其他服務同步節點
int registryCount = this.registry.syncUp();
// 修改eureka狀態為up 同時,這裡面會開啟一個定時任務,用於清理 60秒沒有心跳的客戶端。自動下線
this.registry.openForTraffic(this.applicationInfoManager, registryCount);
// .....省略N多程式碼
EurekaMonitors.registerAllStats();
}
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
// 計算每分鐘最大續約數
this.expectedNumberOfRenewsPerMin = count * 2;
// 每分鐘最小續約數
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
logger.info("Got " + count + " instances from neighboring DS node");
logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
this.startupTime = System.currentTimeMillis();
if (count > 0) {
this.peerInstancesTransferEmptyOnStartup = false;
}
DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
boolean isAws = Name.Amazon == selfName;
if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
logger.info("Priming AWS connections for all replicas..");
primeAwsReplicas(applicationInfoManager);
}
logger.info("Changing status to UP");
// 設定例項的狀態為UP
applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
// 開啟定時任務,預設60秒執行一次,用於清理60秒之內沒有續約的例項
super.postInit();
}
void initEurekaServerContext() throws Exception {
// .....省略N多程式碼
// 從其他服務同步節點
int registryCount = this.registry.syncUp();
// 修改eureka狀態為up 同時,這裡面會開啟一個定時任務,用於清理 60秒沒有心跳的客戶端。自動下線
this.registry.openForTraffic(this.applicationInfoManager, registryCount);
// .....省略N多程式碼
EurekaMonitors.registerAllStats();
}
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
// 計算每分鐘最大續約數
this.expectedNumberOfRenewsPerMin = count * 2;
// 每分鐘最小續約數
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
logger.info("Got " + count + " instances from neighboring DS node");
logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
this.startupTime = System.currentTimeMillis();
if (count > 0) {
this.peerInstancesTransferEmptyOnStartup = false;
}
DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
boolean isAws = Name.Amazon == selfName;
if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
logger.info("Priming AWS connections for all replicas..");
primeAwsReplicas(applicationInfoManager);
}
logger.info("Changing status to UP");
// 設定例項的狀態為UP
applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
// 開啟定時任務,預設60秒執行一次,用於清理60秒之內沒有續約的例項
super.postInit();
}
從上面的程式碼粗略上來看,沒有什麼問題, 假如存在下面這種情況
Eureka Client 增量同步
Eureka Server 同步叢集節點資料
Eureka Client 增量同步
Eureka Server 同步叢集節點資料
當Eureka Server還沒有同步完成節點資料的時候, Eureka Client就過來拉取資料了,如此,Eureka Client拉取到的
就是不完整的或者是空的資料,這樣還是會造成上面的問題,只不過機率比較小、
完整解決方案
修改配置檔案
eureka:
instance:
hostname: server1
initial-status: STARTING
client:
serviceUrl:
defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/
fetch-registry: true
register-with-eureka: true
eureka:
instance:
hostname: server1
initial-status: STARTING
client:
serviceUrl:
defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/
fetch-registry: true
register-with-eureka: true
新增eureka.instance.initial-status: STARTING 表示在Eureka Server 剛剛啟動的時候,預設不主動去註冊,等待服務同步資料完成之後
再去註冊。
自定義過濾器
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();
InstanceStatus status = myInfo.getStatus();
if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {
throw new RuntimeException("Eureka Server status is not UP ,do not provide service ");
}
chain.doFilter(request, response);
}
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();
InstanceStatus status = myInfo.getStatus();
if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {
throw new RuntimeException("Eureka Server status is not UP ,do not provide service ");
}
chain.doFilter(request, response);
}
自定義過濾器,當Eureka Server的狀態不是UP的時候,不對外提供服務。 只有當Eureka Server啟動完成並且同步資料完成
才會修改狀態為UP, 防止Eureka Client獲取到不完整的資料。
@Bean
public CustomerStatusFilter statusFilter(){
return new CustomerStatusFilter();
}
@Bean
public FilterRegistrationBean someFilterRegistration() {
FilterRegistrationBean registration = new FilterRegistrationBean();
registration.setFilter(statusFilter());
registration.addUrlPatterns("/*");
return registration;
}
@Bean
public CustomerStatusFilter statusFilter(){
return new CustomerStatusFilter();
}
@Bean
public FilterRegistrationBean someFilterRegistration() {
FilterRegistrationBean registration = new FilterRegistrationBean();
registration.setFilter(statusFilter());
registration.addUrlPatterns("/*");
return registration;
}
弊端: 加入這個過濾器,如果在叢集完全沒有啟動的時候,一臺一臺的啟動的話,預設需要150秒才可以正常提供服務。