Datax3.0的安裝和基本使用
阿新 • • 發佈:2019-01-06
安裝過程:
1、先解壓datax的安裝包
[root@slave1 datax]# tar -xvf datax.tar.gz
2、個datax的安裝路徑授權
[root@slave1 datax]# chmod -R 775 ./datax
3、測試樣例
[[email protected] bin]# python datax.py ../job/job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2016, Alibaba Group. All Rights Reserved.
2016 -12-10 02:55:17.640 [main] INFO VMInfo - VMInfo# operatingSystem class => com.sun.management.UnixOperatingSystem
2016-12-10 02:55:17.654 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.7 24.79-b02
jvmInfo: Linux amd64 2.6.32-431.el6.x86_64
cpu num: 1
totalPhysicalMemory: 1.83G
freePhysicalMemory: 0.09G
maxFileDescriptorCount: 4096
currentOpenFileDescriptorCount: 67
GC Names [MarkSweepCompact, Copy]
MEMORY_NAME | allocation_size | init_size
Tenured Gen | 682.69 MB | 682.69MB
Eden Space | 273.06MB | 273.06MB
Code Cache | 48.00MB | 2.44MB
Perm Gen | 82.00MB | 20.75MB
Survivor Space | 34.13MB | 34.13MB
2016-12-10 02:55:17.693 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"string",
"value":"DataX"
},
{
"type":"long",
"value":19890604
},
{
"type":"date",
"value":"1989-06-04 00:00:00"
},
{
"type":"bool",
"value":true
},
{
"type":"bytes",
"value":"test"
}
],
"sliceRecordCount":100000
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":false
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"byte":10485760
}
}
}
2016-12-10 02:55:17.757 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2016-12-10 02:55:17.764 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2016-12-10 02:55:17.765 [main] INFO JobContainer - DataX jobContainer starts job.
2016-12-10 02:55:17.775 [main] INFO JobContainer - Set jobId = 0
2016-12-10 02:55:17.854 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2016-12-10 02:55:17.859 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2016-12-10 02:55:17.859 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2016-12-10 02:55:17.863 [job-0] INFO JobContainer - jobContainer starts to do split ...
2016-12-10 02:55:17.864 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
2016-12-10 02:55:17.875 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
2016-12-10 02:55:17.875 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
2016-12-10 02:55:17.931 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2016-12-10 02:55:17.960 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2016-12-10 02:55:17.979 [job-0] INFO JobContainer - Running by standalone Mode.
2016-12-10 02:55:18.017 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2016-12-10 02:55:18.040 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2016-12-10 02:55:18.049 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2016-12-10 02:55:18.117 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2016-12-10 02:55:18.628 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[516]ms
2016-12-10 02:55:18.628 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2016-12-10 02:55:28.074 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.055s | All Task WaitReaderTime 0.360s | Percentage 100.00%
2016-12-10 02:55:28.074 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2016-12-10 02:55:28.076 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2016-12-10 02:55:28.077 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2016-12-10 02:55:28.079 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2016-12-10 02:55:28.082 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /root/datax/datax/hook
2016-12-10 02:55:28.087 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
14.42% | 14.42% | 14.42%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
MarkSweepCompact | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
Copy | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2016-12-10 02:55:28.089 [job-0] INFO JobContainer - PerfTrace not enable!
2016-12-10 02:55:28.091 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.055s | All Task WaitReaderTime 0.360s | Percentage 100.00%
2016-12-10 02:55:28.093 [job-0] INFO JobContainer -
任務啟動時刻 : 2016-12-10 02:55:17
任務結束時刻 : 2016-12-10 02:55:28
任務總計耗時 : 10s
任務平均流量 : 253.91KB/s
記錄寫入速度 : 10000rec/s
讀出記錄總數 : 100000
讀寫失敗總數 : 0
以上表示能夠正常使用
配置測試樣例:
下面我們配置一組 從mysql資料庫到另一個mysql資料庫
第一步、建立作業的配置檔案(json格式)
可以通過命令檢視配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
[root@slave1 bin]# python datax.py -r mysqlreader -w mysqlwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2016, Alibaba Group. All Rights Reserved.
Please refer to the mysqlreader document:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
Please refer to the mysqlwriter document:
https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": "",
"table": []
}
],
"password": "",
"preSql": [],
"session": [],
"username": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
第二步、根據配置檔案模板填寫相關選項
命令列印裡面包含對應reader、writer的文件地址,以及配置json樣例,根據json樣例填空完成配置即可。根據模板配置json檔案(mysql2mysql.json)如下
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["store_id","store_type","region_id","store_name","store_number","store_street_address","store_city","store_state","store_postal_code","store_country","store_manager","store_phone","store_fax","first_opened_date","last_remodel_date","store_sqft","grocery_sqft","frozen_sqft","meat_sqft","coffee_bar","video_store","salad_bar","prepared_food","florist"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://192.168.18.149:3306/saiku"],
"table": ["store"]
}
],
"password": "root",
"username": "root"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["store_id","store_type","region_id","store_name","store_number","store_street_address","store_city","store_state","store_postal_code","store_country","store_manager","store_phone","store_fax","first_opened_date","last_remodel_date","store_sqft","grocery_sqft","frozen_sqft","meat_sqft","coffee_bar","video_store","salad_bar","prepared_food","florist"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.18.147:3306/saiku",
"table": ["store"]
}
],
"password": "root",
"username": "root"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
第三步:啟動DataX
[root@slave1 bin]# python datax.py ./mysql2mysql.json
同步結束,顯示日誌如下:
任務啟動時刻 : 2016-12-10 07:47:16
任務結束時刻 : 2016-12-10 07:47:27
任務總計耗時 : 11s
任務平均流量 : 320B/s
記錄寫入速度 : 2rec/s
讀出記錄總數 : 25
讀寫失敗總數 : 0
job的配置
(1)、Job基本配置
Job基本配置定義了一個Job基礎的、框架級別的配置資訊,包括:
{
"job": {
"content": [
{
"reader": {
"name": "",
"parameter": {}
},
"writer": {
"name": "",
"parameter": {}
}
}
],
"setting": {
"speed": {},
"errorLimit": {}
}
}
}
(2) Job Setting配置
{
"job": {
"content": [
{
"reader": {
"name": "",
"parameter": {}
},
"writer": {
"name": "",
"parameter": {}
}
}
],
"setting": {
"speed": {
"channel": 1,
"byte": 104857600
},
"errorLimit": {
"record": 10,
"percentage": 0.05
}
}
}
}
● job.setting.speed(流量控制)
Job支援使用者對速度的自定義控制,channel的值可以控制同步時的併發數,byte的值可以控制同步時的速度
● job.setting.errorLimit(髒資料控制)
Job支援使用者對於髒資料的自定義監控和告警,包括對髒資料最大記錄數閾值(record值)或者髒資料佔比閾值(percentage值),當Job傳輸過程出現的髒資料大於使用者指定的數量/百分比,DataX Job報錯退出。