【原創】大叔經驗分享(39)spark cache unpersist級聯操作
問題:spark中如果有兩個DataFrame(或者DataSet),DataFrameA依賴DataFrameB,並且兩個DataFrame都進行了cache,將DataFrameB unpersist之後,DataFrameA的cache也會失效,官方解釋如下:
When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.
However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.
之前默認的模式為regular mode,這種模式下為了保證被cache數據是最新的(沒有過期),會對cache的unpersist進行級聯操作,即清空所有依賴(包括間接依賴)該cache的其他cache;
DataFrame/DataSet的cache操作默認用的level是MEMORY_AND_DISK,除非手工指定MEMORY,並且確認內存足夠,否則unpersist之前的cache看起來沒有必要;
參考:
https://issues.apache.org/jira/browse/SPARK-21478
https://issues.apache.org/jira/browse/SPARK-24596
https://issues.apache.org/jira/browse/SPARK-21579
【原創】大叔經驗分享(39)spark cache unpersist級聯操作