1. 程式人生 > >oozie4.3.0+sqoop1.4.6實現mysql到hive的增量抽取

oozie4.3.0+sqoop1.4.6實現mysql到hive的增量抽取

ssa 使用 註意 表分區 namenode localhost coord 分隔 his

1.準備數據源

mysql中表bigdata,數據如下:

技術分享圖片

2. 準備目標表

目標表存放hive中數據庫dw_stg表bigdata

保存路徑為 hdfs://localhost:9000/user/hive/warehouse/dw_stg.db/bigdata

hive中建表語句如下:

create external table bigdata(
class_id string comment 課程id,
class_name string comment 課程名稱,
class_month int comment 課程周期,
teacher string comment 
課程老師, update_time string comment 更新日期 ) partitioned by(dt string comment 年月日) row format delimited fields terminated by \001 lines terminated by \n stored as textfile;

註意點: 字段分隔符使用\001,行分隔符使用\n ,增加表分區dt格式為yyyMMdd

在hive中創建上面表bigdata.

3. 編寫oozie腳本文件

3.1 配置job.properties

# 公共變量
timezone
=Asia/Shanghai jobTracker=dwtest-name1:8032 nameNode=hdfs://dwtest-name1:9000 queueName=default warehouse=/user/hive/warehouse dw_stg=${warehouse}/dw_stg.db dw_mdl=${warehouse}/dw_mdl.db dw_dm=${warehouse}/dw_dm.db app_home=/user/oozie/app oozie.use.system.libpath=true # coordinator oozie.coord.application.path
=${nameNode}${app_home}/bigdata/coordinator.xml workflow=${nameNode}${app_home}/bigdata # source connection=jdbc:mysql://192.168.1.100:3306/test username=test password=test source_table=bigdata # target target_path=${dw_stg}/bigdata # 腳本啟動時間,結束時間 start=2018-01-24T10:00+0800 end=2199-01-01T01:00+0800

3.2 配置coordinator.xml

<coordinator-app name="coord_bigdata" frequency="${coord:days(1)}" start="${start}" end="${end}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.5">
    <action>
        <workflow>
            <app-path>${workflow}</app-path>
            <configuration>
                <property>
                    <name>startTime</name>
                    <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, ‘DAY‘), ‘yyyy-MM-dd 00:00:00‘)}</value>
                </property>
                <property>
                    <name>endTime</name>
                    <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), 0, ‘DAY‘), ‘yyyy-MM-dd 00:00:00‘)}</value>
                </property>
                <property>
                    <name>outputPath</name>
                    <value>${target_path}/dt=${coord:formatTime(coord:dateOffset(coord:nominalTime(), 0, ‘DAY‘), ‘yyyyMMdd‘)}/</value>
                </property>
            </configuration>
        </workflow>

    </action>
</coordinator-app>

註意點:

增量的開始時間startTime獲取: 當前時間的前一天 輸出值為 2018-01-23 00:00:00

${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, DAY), yyyy-MM-dd 00:00:00)}

增量的結束時間endTime獲取: 輸出值為 2018-01-24 00:00:00

${coord:formatTime(coord:dateOffset(coord:nominalTime(), 0, DAY), yyyy-MM-dd 00:00:00)}

輸出路徑需要帶上分區字段dt: 輸出值 /user/hive/warehouse/dw_stg.db/bigdata/dt=20180124/

${target_path}/dt=${coord:formatTime(coord:dateOffset(coord:nominalTime(), 0, DAY), yyyyMMdd)}/

3.3 配置workflow.xml

 1 <?xml version="1.0" encoding="UTF-8"?>
 2 <!--
 3   Licensed to the Apache Software Foundation (ASF) under one
 4   or more contributor license agreements.  See the NOTICE file
 5   distributed with this work for additional information
 6   regarding copyright ownership.  The ASF licenses this file
 7   to you under the Apache License, Version 2.0 (the
 8   "License"); you may not use this file except in compliance
 9   with the License.  You may obtain a copy of the License at
10 
11        http://www.apache.org/licenses/LICENSE-2.0
12 
13   Unless required by applicable law or agreed to in writing, software
14   distributed under the License is distributed on an "AS IS" BASIS,
15   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16   See the License for the specific language governing permissions and
17   limitations under the License.
18 -->
19 <workflow-app xmlns="uri:oozie:workflow:0.4" name="wf_bigdata">
20     <start to="sqoop-node"/>
21 
22     <action name="sqoop-node">
23         <sqoop xmlns="uri:oozie:sqoop-action:0.2">
24             <job-tracker>${jobTracker}</job-tracker>
25             <name-node>${nameNode}</name-node>
26             <prepare>
27                 <delete path="${nameNode}${outputPath}"/>
28             </prepare>
29             <configuration>
30                 <property>
31                     <name>mapred.job.queue.name</name>
32                     <value>${queueName}</value>
33                 </property>
34             </configuration>
35             <arg>import</arg>
36             <arg>--connect</arg>
37             <arg>${connection}</arg>
38             <arg>--username</arg>
39             <arg>${username}</arg>
40             <arg>--password</arg>
41             <arg>${password}</arg>
42             <arg>--verbose</arg>
43             <arg>--query</arg>
44             <arg>select class_id,class_name,class_month,teacher,update_time from ${source_table} where $CONDITIONS and update_time &gt;= ‘${startTime}‘ and update_time &lt; ‘${endTime}‘</arg>
45             <arg>--fields-terminated-by</arg>
46             <arg>\001</arg>
47             <arg>--target-dir</arg>
48             <arg>${outputPath}</arg>
49             <arg>-m</arg>
50             <arg>1</arg>
51         </sqoop>
52         <ok to="end"/>
53         <error to="fail"/>
54     </action>
55 
56     <kill name="fail">
57         <message>Sqoop free form failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
58     </kill>
59     <end name="end"/>
60 </workflow-app>

4. 上傳腳本

將以上3個文件上傳到hdfs的oozie目錄app下如下:

技術分享圖片

5. 執行job

oozie job -config job.properties -run

技術分享圖片

6. 查看job狀態

技術分享圖片

7. 查詢hive中表

使用 msck repair table bigdata 自動修復分區,然後查詢結果,測試沒用問題。

技術分享圖片

8. 開發中遇到的坑如下:

8.1 workflow.xml中字段分隔符不能帶單引號。正確的是<arg>\001</arg> ,錯誤的是<arg>‘\001‘</arg>

8.2 由於sqoop的腳本配置在xml中,所以在判斷條件時使用小於號"<"會報錯,xml文件校驗不通過。

解決方法使用 &lt; 代替 "<" ,所以使用大於號時最好也使用 &gt;代替 ">"

oozie4.3.0+sqoop1.4.6實現mysql到hive的增量抽取