Sqoop分批匯入Mysql上億條資料的表到HDFS
阿新 • • 發佈:2019-01-25
因資料量過大,執行sqoop跑不動或者卡記憶體,於是通過寫指令碼分批匯入到HDFS,然後再載入到Hive表中。
shell指令碼如下:
#!/bin/bash
source /etc/profile
host=127.0.0.1
for((i=1; i<=100; i++))
do
start=$(((${i} - 1) * 100000 + 1))
end=$((${i} * 100000))
sql="select person_id,capture_time,write_time,capture_resource_id,major_capture_image_url,minor_capture_image_url,sex,age,orientation,glasses,knapsack, bag,messenger_bag,shoulder_bag,umbrella,hair,hat,mask,upper_color,upper_type,upper_texture,bottom_color,bottom_type,trolley_case,barrow,baby,feature_type,feature_code from big_data.pedestrian_sm where person_id>=${start} and person_id<=${end} and \$CONDITIONS";
sqoop import --connect jdbc:mysql://${host}:3306/big_data \
--username root \
--password 123456 \
--query "${sql}" \
--fields-terminated-by '\001' \
--delete-target-dir \
--target-dir hdfs://hsmaster:9000/tmp/big_data/pedestrian_sm/${start} -${end}/ \
--split-by person_id \
-m 8
echo Sqoop import from: ${start} to: ${end} success....................................
hive -e "
use big_data;
load data inpath 'hdfs://master:9000/tmp/big_data/pedestrian_sm/${start}-${end}' into table big_data.pedestrian_sm;
"
echo Hive load from: ${start}-${end} success....................................
done