Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

阿新 • • 發佈：2018-12-26

在spark中《Memory usage of state in Spark Structured Streaming》講解Spark記憶體分配情況，以及提到了HDFSBackedStateStoreProvider儲存多個版本的影響；從stackoverflow上也可以看到別人遇到了structured streaming中記憶體問題，同時也對問題做了分析《Memory issue with spark structured streaming》；另外可以從spark的官網問題修復列表中檢視到如下內容：

1）在流聚合中從值中刪除冗餘金鑰資料（Split out min retain version of state for memory in HDFSBackedStateStoreProvider）

問題描述：

HDFSBackedStateStoreProvider has only one configuration for minimum versions to retain of state which applies to both memory cache and files. As default version of "spark.sql.streaming.minBatchesToRetain" is set to high (100), which doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of memory consumption for various workloads. In addition, in some cases, requiring 2x of memory is even unacceptable, so we should split out configuration for memory and let users adjust to trade-off memory usage vs cache miss.

In normal case, default value '2' would cover both cases: success and restoring failure with less than or around 2x of memory usage, and '1' would only cover success case but no longer require more than 1x of memory. In extreme case, user can set the value to '0' to completely disable the map cache to maximize executor memory.

修復情況：

對應官網bug情況概述《[SPARK-24717][SS] Split out max retain version of state for memory in HDFSBackedStateStoreProvider #21700》、《Split out min retain version of state for memory in HDFSBackedStateStoreProvider》

HDFSBackedStateStoreProvider儲存state的目錄結構在該文章中介紹的，另外這些檔案是一個系列，建議可以多讀讀，下邊借用作者文章中的圖展示下state儲存目錄結構：

2）在HDFSBackedStateStoreProvider中為記憶體分配最大保留版本的狀態（Remove redundant key data from value in streaming aggregation）

問題描述：

Key/Value of state in streaming aggregation is formatted as below:

key: UnsafeRow containing group-by fields
value: UnsafeRow containing key fields and another fields for aggregation results

which data for key is stored to both key and value.

This is to avoid doing projection row to value while storing, and joining key and value to restore origin row to boost performance, but while doing a simple benchmark test, I found it not much helpful compared to "project and join". (will paste test result in comment)

So I would propose a new option: remove redundant in stateful aggregation. I'm avoiding to modify default behavior of stateful aggregation, because state value will not be compatible between current and option enabled.

修復情況：

對應官網bug情況概述《[SPARK-24763][SS] Remove redundant key data from value in streaming aggregation #21733》、《Remove redundant key data from value in streaming aggregation》

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

在spark中《Memory usage of state in Spark Structured Streaming》講解Spark記憶體分配情況，以及提到了HDFSBackedStateStoreProvider儲存多個版本的影響；從stackoverflow上也可以看到別人遇到了structured s

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

1）在流聚合中從值中刪除冗餘金鑰資料（Split out min retain version of state for memory in HDFSBackedStateStoreProvider）

2）在HDFSBackedStateStoreProvider中為記憶體分配最大保留版本的狀態（Remove redundant key data from value in streaming aggregation）

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

Spark2.2(三十九)：如何根據appName監控spark任務,當任務不存在則啟動（任務存在當超過多久沒有活動狀態則kill，等待下次啟動）

Android實戰技巧之三十八：Handler使用中可能引發的內存泄漏

練習三十八：矩陣for迴圈應用

敏捷開發一千零一問系列之三十八：計劃撲克就是打不出個結果怎麼辦？

Android學習筆記三十八：Android4.0 Socket異常，需要另外開闢執行緒進行Socket程式設計

Spring Boot 2.X(十八)：整合 Spring Security-登入認證和許可權控制

Android項目實戰（三十八）：2017最新將AndroidLibrary提交到JCenter倉庫（圖文教程）

用Vue來實現音樂播放器（三十八）：歌詞滾動列表的問題

演算法題（三十八）：判斷樹的子結構

Java基礎系列（三十八）：集合總覽

“全棧2019”Java第三十八章：類與方法

Scrum立會報告+燃盡圖（十二月七日總第三十八次）：功能測試

劍指offer第三十八題：二叉樹的深度

4.8 數字金額大寫轉換外掛 > 我的程式猿之路：第三十八章

Python之路(第三十八篇) 併發程式設計：程序同步鎖/互斥鎖、訊號量、事件、佇列、生產者消費者模型

IFE第二十八天到第三十天：給愛的人發個郵件吧（郵箱輸入下拉框提示功能）

論文閱讀筆記（三十八）：Dynamic Zoom-in Network for Fast Object Detection in Large Images

演算法題練習系列之（三十八）：斐波那契鳳尾

SpringBoot | 第三十八章：基於RabbitMQ實現訊息延遲佇列方案

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗記憶體比較多的問題（Memory issue with spark structured streaming）調研

1）在流聚合中從值中刪除冗餘金鑰資料（Split out min retain version of state for memory in HDFSBackedStateStoreProvider）

2）在HDFSBackedStateStoreProvider中為記憶體分配最大保留版本的狀態（Remove redundant key data from value in streaming aggregation）

相關推薦