記錄一次concurrent mode failure問題排查過程以及解決思路

阿新 • • 發佈：2017-07-10

tails only cnblogs 策略 executor red execute incr run

背景：後臺定時任務腳本每天淩晨5點30會執行一個批量掃庫做業務的邏輯。

gc錯誤日誌：

2017-07-05T05:30:54.408+0800: 518534.458: [CMS-concurrent-mark-start]
2017-07-05T05:30:55.279+0800: 518535.329: [GC 518535.329: [ParNew: 838848K->838848K(1118464K), 0.0000270 secs]
[CMS-concurrent-mark: 1.564/1.576 secs] [Times: user=10.88 sys=0.31, real=1.57 secs]
 (concurrent mode failure 
): 2720535K->2719116K(2796224K), 13.3742340 secs] 
 3559383K->2719116K(3914688K), 
 [CMS Perm : 38833K->38824K(524288K)], 13.3748020 secs] [Times: user=16.19 sys=0.00, real=13.37 secs]
2017-07-05T05:31:08.659+0800: 518548.710: [GC [1 CMS-initial-mark: 2719116K(2796224K)] 2733442K(3914688K), 0.0065150 secs] [Times: user=0.01 
 sys=0.00, real=0.01 secs]
2017-07-05T05:31:08.666+0800: 518548.716: [CMS-concurrent-mark-start]
2017-07-05T05:31:09.528+0800: 518549.578: 
[GC 518549.578: [ParNew: 838848K->19737K(1118464K), 0.0055800 secs] 
3557964K->2738853K(3914688K), 0.0060390 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
[CMS-concurrent-mark: 1.644/1.659 secs] [Times: user=14.15 
 sys=0.84, real=1.66 secs]
2017-07-05T05:31:10.326+0800: 518550.376: [CMS-concurrent-preclean-start]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-abortable-preclean-start]

借鑒於:understanding-cms-gc-logs

得知導致concurrent mode failure的原因有是： there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure”

解決的方案有： The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

第二種方案需要綜合考慮下，因為如果設置的CMSInitiatingOccupancyFraction過低有可能導致頻繁的cms 降低性能。［參考不建議3g下配置cms：why no cms under 3G］

問題排查：

1 jvm參數配置 -Xmx4096m -Xms2048m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails 幾乎沒什麽問題

2 從報警時間看每天淩晨5點30報警一次, 應該是定時任務的問題。

該問題很容易排查，服務是個腳本服務，線上業務邏輯幾乎沒有，所以根據時間點找到定時任務的業務邏輯，就可以分析出來問題。

業務代碼：

　　　　 int batchNumber = 1;
        int realCount = 0;
        int offset = 0;
        int limit = 999;
        int totalCount = 0;
        //初始化20個大小的線程池
        ExecutorService service = Executors.newFixedThreadPool(20);
        while (true) {
            LogUtils.info(logger, "{0},{1}->{2}", batchNumber, offset, (offset + limit));
            try {
                //分頁查詢
                Set<String> result = query(offset, limit);
                realCount = result.size();
                //將查詢出的數據放入線程池執行
                service.execute(new AAAAAAA(result, batchNumber));
            } catch (Exception e) {
                LogUtils.error(logger, e, "exception,batch:{0},offset:{1},count:{2}", batchNumber, offset, limit);
                break;
            }
            totalCount += realCount;
            if (realCount < limit) {
                break;
            }
            batchNumber++;
            offset += limit;
        }
        service.shutdown();

用了一個固定20個線程的線程池，循環執行每次從數據庫裏面取出來999條數據放到線程池裏面去跑

分析

newFixedThreadPool
底層用了一個

LinkedBlockingQueue
無限隊列，而我的數據有2kw+條,這樣死循環取數據放到隊列裏面沒有把內存撐爆算好的吧？？？

最後換成

BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20);
ThreadPoolExecutor service = new ThreadPoolExecutor(20, 20, 1, TimeUnit.HOURS, queue, new ThreadPoolExecutor.CallerRunsPolicy());

用了個固定長度的隊列，而且失敗策略用的callerruns，可以理解為不能執行並且不能加入等待隊列的時候主線程會直接跑run方法，會造成多線程變單線程，降低效率。

明天看看效果如何。

記錄一次concurrent mode failure問題排查過程以及解決思路

tails only cnblogs 策略 executor red execute incr run 背景：後臺定時任務腳本每天淩晨5點30會執行一個批量掃庫做業務的邏輯。 gc錯誤日誌： 2017-07-05T05:30:54.408+0800: 518534

記錄一次concurrent mode failure問題排查過程以及解決思路

背景：後臺定時任務腳本每天淩晨5點30會執行一個批量掃庫做業務的邏輯。

gc錯誤日誌：

解決的方案有： The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

問題排查：

記錄一次concurrent mode failure問題排查過程以及解決思路

記錄一次Mysql死鎖排查過程

記錄一次抽獎超發排查問題過程

記錄一次郵件容災恢復過程

記錄一次服務器“卡死”故障的解決過程

記錄一次Python下Tensorflow安裝過程，1.7帶GPU加速版本

MySQL-記一次備份失敗的排查過程

一次 Java 記憶體洩漏排查過程，漲姿勢

ubuntu常用命令及操作，包括安裝CUDA 記錄一次Python下Tensorflow安裝過程，1.7帶GPU加速版本

記錄一次Session偶爾獲取不到的解決過程

記一次線上問題的排查過程

一次CMS GC問題排查過程（理解原理+讀懂GC日誌）

記錄一次ssm轉springboot專案過程

一次Mysql死鎖排查過程的全紀錄

一次JobTracker擁堵問題排查過程

記錄一次安裝OpenGL的漫長過程

Kafka 非同步訊息也會阻塞？記一次 Dubbo 頻繁超時排查過程

解Bug之路-記一次儲存故障的排查過程

記錄一次Centos磁碟空間佔滿的解決辦法

記錄一次Spring Data Solr相關的錯誤解決

記錄一次concurrent mode failure問題排查過程以及解決思路

背景：後臺定時任務腳本每天淩晨5點30會執行一個批量掃庫做業務的邏輯。

gc錯誤日誌：

解決的方案有： The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

問題排查：

相關推薦