hive中對lzo壓縮檔案建立索引實現並行處理
1,確保建立索引
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/lib/hadoop-lzo-0.4.10.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/flog
2 如果在hive中新建外部表的語句為
CREATE EXTERNAL TABLE foo ( columnA string, columnB string ) PARTITIONED BY (date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/path/to/hive/tables/foo';
3 對於已經存在的表修改語句為
ALTER TABLE foo
SET FILEFORMAT
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
4 alter table後對已經load進表中的資料,需要重新load和建立索引,要不還是不能分塊
5 用hadoop streaming程式設計執行mapreduce作業語句為
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar -file /home/pyshell/map.py -file /home/pyshell/red.py -mapper /home/pyshell/map.py -reducer /home/pyshell/red.py -input /aojianlog/20120304/gold/gold_38_3.csv.lzo -output /aojianresult/gold38 -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
注意 如果沒有-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat 選項的話map作業也不會分片
沒有-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec選項只設置-jobconf mapred.output.compress=true 選項的話 reduce作業輸出檔案的格式為.lzo_deflate