【原創】問題定位分享(17)spark查orc格式資料偶爾報錯NullPointerException
阿新 • • 發佈:2018-12-19
spark查orc格式的資料有時會報這個錯
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟進程式碼
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind { HYBRID, BI, ETL } ... Context(Configuration conf) { this.conf = conf; minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE); maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE); String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) { splitStrategyKind = SplitStrategyKind.HYBRID; } else { LOG.info("Enforcing " + ss + " ORC split strategy"); splitStrategyKind = SplitStrategyKind.valueOf(ss); } ... switch(context.splitStrategyKind) {case BI: // BI strategy requested through config splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; case ETL: // ETL strategy requested through config splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; default: // HYBRID strategy if (avgFileSize > context.maxSize) { splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } else { splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } break; }
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"), "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" + " as opposed to query execution (split generation does not read or cache file footers)." + " ETL strategy is used when spending little more time in split generation is acceptable" + " (split generation reads and caches file footers). HYBRID chooses between the above strategies" + " based on heuristics."),
可見hive.exec.orc.split.strategy預設是HYBRID,HYBRID時如果不滿足
if (avgFileSize > context.maxSize) {
則
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
報錯的就是BISplitStrategy,具體這個類為什麼報錯還沒有細看,不過可以修改設定避免這個問題
set hive.exec.orc.split.strategy=ETL
問題暫時解決,未完待續;