HDP Hive StorageHandler 下推優化的坑
關鍵詞:hdp , hive , StorageHandler
了解Hive StorageHandler的同學都知道,StorageHandler作為Hive適配不同存儲的拓展類,同時肩負著HiveStoragePredicateHandler的角色對相關存儲做下推優化,核心方法如下:
/** * HiveStoragePredicateHandler is an optional companion to {@link * HiveStorageHandler}; it should only be implemented by handlers which * support decomposition of predicates being pushed down into table scans.*/ public interface HiveStoragePredicateHandler { /** * Gives the storage handler a chance to decompose a predicate. The storage * handler should analyze the predicate and return the portion of it which * cannot be evaluated during table access. For example, if the original * predicate is <code>x = 2 AND upper(y)=‘YUM‘</code>, the storage handler * might be able to handle <code>x = 2</code> but leave the "residual" * <code>upper(y)=‘YUM‘</code> for Hive to deal with. The breakdown * need not be non-overlapping; for example, given the * predicate <code>x LIKE ‘a%b‘</code>, the storage handler might * be able to evaluate the prefix search <code>x LIKE ‘a%‘</code>, leaving * <code>x LIKE ‘%b‘</code> as the residual. * * @param jobConf contains a job configuration matching the one that * will later be passed to getRecordReader and getSplits * * @param deserializer deserializer which will be used when * fetching rows * * @param predicate predicate to be decomposed * * @return decomposed form of predicate, or null if no pushdown is * possible at all*/ public DecomposedPredicate decomposePredicate( JobConf jobConf, Deserializer deserializer, ExprNodeDesc predicate); /** * Struct class for returning multiple values from decomposePredicate. */ public static class DecomposedPredicate { /** * Portion of predicate to be evaluated by storage handler. Hive * will pass this into the storage handler‘s input format.*/ public ExprNodeGenericFuncDesc pushedPredicate; /** * Serialized format for filter */ public Serializable pushedPredicateObject; /** * Portion of predicate to be post-evaluated by Hive for any rows * which are returned by storage handler. */ public ExprNodeGenericFuncDesc residualPredicate; } }
核心方法便是decomposePredicate方法,返回一個 DecomposePredicate 對象,其中,對象中的屬性成員 Serializable pushedPredicateObject 是一個自由度非常高的屬性,你可以把你任何下推的結果、配置、甚至在下推中解析表達樹得到的一些函數聲明等都可以傳遞出去,給到InputFormat側去決定如何讀取數據。但是在HDP 2.2.6-2800(對應Hive 0.14.0.2.2.6-2800)和 HDP 2.4.2.0-258 (對應 Hive 1.2.1000.2.4.2.0-258) 中,經測試,DecomposePredicate的另外兩個屬性都能起效,唯獨pushedPredicateObject怎麽都拿不到,在InputFormat側一直為null。
單步跟了Hive 0.14.0.2.2.6.0的源碼,pushedPredicateObject測試能用,本地打包上傳測試服務器替換原來的hive-exec jar包重啟HiveServer2,居然也測試成功能用。由於HDP的代碼小版本號太多,而且也不確定後面橫線後的版本號對應的數字是代表什麽意思(revision?),所以暫時找不到確定的源碼了,認為最近似的源碼2.2.6.0手動編譯打包的是沒問題的。
只能姑且認為是HDP的一個莫名的坑,有基於HDP的Hive做下推優化的同學需要留意一下這個問題。
HDP Hive StorageHandler 下推優化的坑