1. 程式人生 > >Sqoop分批匯入Mysql上億條資料的表到HDFS

Sqoop分批匯入Mysql上億條資料的表到HDFS

因資料量過大,執行sqoop跑不動或者卡記憶體,於是通過寫指令碼分批匯入到HDFS,然後再載入到Hive表中。
shell指令碼如下:

#!/bin/bash
source /etc/profile

host=127.0.0.1

for((i=1; i<=100; i++))
do   
    start=$(((${i} - 1) * 100000 + 1))
    end=$((${i} * 100000))

    sql="select person_id,capture_time,write_time,capture_resource_id,major_capture_image_url,minor_capture_image_url,sex,age,orientation,glasses,knapsack, bag,messenger_bag,shoulder_bag,umbrella,hair,hat,mask,upper_color,upper_type,upper_texture,bottom_color,bottom_type,trolley_case,barrow,baby,feature_type,feature_code from big_data.pedestrian_sm where person_id>=${start}
and person_id<=${end} and \$CONDITIONS"
; sqoop import --connect jdbc:mysql://${host}:3306/big_data \ --username root \ --password 123456 \ --query "${sql}" \ --fields-terminated-by '\001' \ --delete-target-dir \ --target-dir hdfs://hsmaster:9000/tmp/big_data/pedestrian_sm/${start}
-${end}/ \ --split-by person_id \ -m 8 echo Sqoop import from: ${start} to: ${end} success.................................... hive -e " use big_data; load data inpath 'hdfs://master:9000/tmp/big_data/pedestrian_sm/${start}-${end}' into table big_data.pedestrian_sm; " echo
Hive load from: ${start}-${end} success.................................... done