1. 程式人生 > >Eureka Server叢集重啟問題追蹤

Eureka Server叢集重啟問題追蹤

問題

在生產環境重啟Eureka Server叢集的時候,發現訂單客戶端呼叫分散式Id生成服務出錯,

Caused by: com.netflix.client.ClientException: Load balancer does not have available server for client: IDGfor client: IDG

顯示訂單服務調不到IDG服務了

問題思考

Eureka Client快取由一個定時執行緒去重新整理,每30秒執行一次增量更新,ribbon每30秒從Eureka Client的本地快取裡面獲取服務的資訊,上面的錯誤,是有ribbon報出來的,說明ribbon裡面IDG服務的資訊不存在, 通過後續除錯,發現Eureka Client的本地快取是空的。 由此引發了一個問題,當Eureka Server正在重啟或者重啟完成,Eureka Client來獲取註冊資訊,然後更新到本地出了問題

問題追蹤

檢查Eureka Client日誌


2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 0
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 5000
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 200
2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null]
2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 0 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 5000 2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 200 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null] 2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check

發現如上片段的日誌,當客戶端的CacheRefreshExecutor(快取重新整理執行緒池) 執行任務的時候

第1行 : 獲取增量更新資料的hashCode

第2行 : 獲取到的增量資料總數為0

第3行 : 節點合併之後,增量資料(服務端)的HashCode和本地client端的HashCode不一致, client = UP5 , Server = “” , 因此需要發起全量獲取

第4..7行 : 發起全量獲取。

發生問題的原因已經很明顯了,就是在Eureka Server重啟的時候,註冊資訊為空,剛好被Eureka Client獲取到,由於HashCode計算不一致

導致發起全量獲取,然後覆蓋本地的快取資料。 導致本地的快取資料更新為錯誤的,由此發生呼叫問題。

通過檢查Eureka Server的配置,發現如下問題:


eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: false 
    register-with-eureka: true   // 將自身註冊到Eureka 叢集上面去eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: false 
    register-with-eureka: true   // 將自身註冊到Eureka 叢集上面去

fetch-registry = false , 這就表明當Eureka Server作為Client註冊到Eureka叢集上面去的時候,預設是不會去全量抓取註冊資訊的 。 但是Eureka Server作為服務端的時候,在服務剛剛啟動的時候,會從本地client獲取註冊資訊(register-with-eureka: true時,他本身也作為客戶端註冊到Eureka上去了),然後註冊到自身的服務上去。 想了解具體詳情可以看:深入理解Eureka Server叢集同步(十)

也就是說Eureka Server剛剛啟動的時候,他作為server端的註冊資訊是空的。 只能依賴後續叢集續約同步的方式,慢慢補全自身的資訊。

通過上面的瞭解,將配置修改成下面這樣:


eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: true   // 將自身註冊到Eureka 叢集上面去eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: true   // 將自身註冊到Eureka 叢集上面去

將fetch-register修改為true, 這樣在Eureka Server 剛剛啟動的時候,就可以將註冊資訊全部註冊到自己的節點上去。

通過併發測試,發現剛剛那個配置只是減小了機率,並不能做到完全避免, 原因如下:

protected void initEurekaServerContext() throws Exception {
   // .....省略N多程式碼
   // 從其他服務同步節點
   int registryCount = this.registry.syncUp();
    // 修改eureka狀態為up 同時,這裡面會開啟一個定時任務,用於清理 60秒沒有心跳的客戶端。自動下線
   this.registry.openForTraffic(this.applicationInfoManager, registryCount);
​
   // .....省略N多程式碼
   EurekaMonitors.registerAllStats();
}
​
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    // 計算每分鐘最大續約數
    this.expectedNumberOfRenewsPerMin = count * 2;
    // 每分鐘最小續約數
    this.numberOfRenewsPerMinThreshold =
            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
    logger.info("Got " + count + " instances from neighboring DS node");
    logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    // 設定例項的狀態為UP
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    // 開啟定時任務,預設60秒執行一次,用於清理60秒之內沒有續約的例項
    super.postInit();
} void initEurekaServerContext() throws Exception {
   // .....省略N多程式碼
   // 從其他服務同步節點
   int registryCount = this.registry.syncUp();
    // 修改eureka狀態為up 同時,這裡面會開啟一個定時任務,用於清理 60秒沒有心跳的客戶端。自動下線
   this.registry.openForTraffic(this.applicationInfoManager, registryCount);
​
   // .....省略N多程式碼
   EurekaMonitors.registerAllStats();
}
​
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    // 計算每分鐘最大續約數
    this.expectedNumberOfRenewsPerMin = count * 2;
    // 每分鐘最小續約數
    this.numberOfRenewsPerMinThreshold =
            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
    logger.info("Got " + count + " instances from neighboring DS node");
    logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    // 設定例項的狀態為UP
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    // 開啟定時任務,預設60秒執行一次,用於清理60秒之內沒有續約的例項
    super.postInit();
}

從上面的程式碼粗略上來看,沒有什麼問題, 假如存在下面這種情況


Eureka Client   增量同步
Eureka Server   同步叢集節點資料Eureka Client   增量同步
Eureka Server   同步叢集節點資料

當Eureka Server還沒有同步完成節點資料的時候, Eureka Client就過來拉取資料了,如此,Eureka Client拉取到的

就是不完整的或者是空的資料,這樣還是會造成上面的問題,只不過機率比較小、

完整解決方案

修改配置檔案


eureka:
  instance:
      hostname: server1
      initial-status: STARTING
  client:
    serviceUrl:
      defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: trueeureka:
  instance:
      hostname: server1
      initial-status: STARTING
  client:
    serviceUrl:
      defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: true

新增eureka.instance.initial-status: STARTING 表示在Eureka Server 剛剛啟動的時候,預設不主動去註冊,等待服務同步資料完成之後

再去註冊。

自定義過濾器


public void doFilter(ServletRequest request, ServletResponse response,
                     FilterChain chain) throws IOException, ServletException {
    InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();
    InstanceStatus status = myInfo.getStatus();
    if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {
        throw  new RuntimeException("Eureka Server status is not UP ,do not provide service ");
    }
    chain.doFilter(request, response);
}public void doFilter(ServletRequest request, ServletResponse response,
                     FilterChain chain) throws IOException, ServletException {
    InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();
    InstanceStatus status = myInfo.getStatus();
    if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {
        throw  new RuntimeException("Eureka Server status is not UP ,do not provide service ");
    }
    chain.doFilter(request, response);
}

自定義過濾器,當Eureka Server的狀態不是UP的時候,不對外提供服務。 只有當Eureka Server啟動完成並且同步資料完成

才會修改狀態為UP, 防止Eureka Client獲取到不完整的資料。


@Bean
public CustomerStatusFilter statusFilter(){
​
    return  new CustomerStatusFilter();
}
@Bean
public FilterRegistrationBean someFilterRegistration() {
​
    FilterRegistrationBean registration = new FilterRegistrationBean();
    registration.setFilter(statusFilter());
    registration.addUrlPatterns("/*");
    return registration;
}@Bean
public CustomerStatusFilter statusFilter(){
​
    return  new CustomerStatusFilter();
}
@Bean
public FilterRegistrationBean someFilterRegistration() {
​
    FilterRegistrationBean registration = new FilterRegistrationBean();
    registration.setFilter(statusFilter());
    registration.addUrlPatterns("/*");
    return registration;
}

弊端: 加入這個過濾器,如果在叢集完全沒有啟動的時候,一臺一臺的啟動的話,預設需要150秒才可以正常提供服務。

sharedCode原始碼交流群,歡迎喜歡閱讀原始碼的朋友加群,新增下面的微信, 備註”加群“ 。