大資料_資料採集引擎(Sqoop和Flume)
阿新 • • 發佈:2019-01-03
一、資料採集引擎
1、準備實驗環境: 準備Oracle資料庫
使用者:sh 表:sales 訂單表(92萬)
2、Sqoop:採集關係型資料庫中的資料
用在離線計算的應用中
強調:批量
(1)資料交換引擎: RDBMS <---> Sqoop <---> HDFS、HBase、Hive
(2)底層依賴MapReduce
(3)依賴JDBC
(4 )安裝:tar -zxvf sqoop-1.4.5.bin__hadoop-0.23.tar.gz -C ~/training/
設定環境變數:
SQOOP_HOME=/root/training/sqoop-1.4.5.bin__hadoop-0.23
export SQOOP_HOME
PATH=$SQOOP_HOME/bin:$PATH
export PATH
注意:如果是Oracle資料庫,大寫:使用者名稱、表名、列名
(*)codegen Generate code to interact with database records
根據表結構自動生成對應Java類
sqoop codegen --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --outdir /root/sqoop
(*)create-hive-table Import a table definition into Hive
(*)eval Evaluate a SQL statement and display the results
在Sqoop中執行SQL
sqoop eval --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --query 'select * from emp'
(*)export Export an HDFS directory to a database table
(*)help List available commands
(*)import Import a table from a database to HDFS
匯入資料
(1)匯入EMP表的所有資料(HDFS上)
sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --target-dir /sqoop/import/emp1
(2)匯入指定的列
sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --columns ENAME,SAL --target-dir /sqoop/import/emp2
(3) 匯入訂單表
sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SH --password sh --table SALES --target-dir /sqoop/import/sales -m 1
錯誤:ERROR tool.ImportTool: Error during import: No primary key could be found for table SALES. Please specify one with --split-by or perform a sequential import with '-m 1'.
(*)import-all-tables Import tables from a database to HDFS
匯入某個使用者下所有的表,預設路徑:/user/root
sqoop import-all-tables --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger
(*)job Work with saved jobs
(*)list-databases List available databases on a server
(*) MySQL資料庫:就是資料庫的名字
(*) Oracle資料庫:是資料庫中所有使用者的名字
sqoop list-databases --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SYSTEM --password password
(*)list-tables List available tables in a database
(*)merge Merge results of incremental imports
(*)metastore Run a standalone Sqoop metastore
(*)version Display version information
3、Flume:採集日誌
用在實時計算(流式計算)的應用中
強調:實時
#bin/flume-ng agent -n a4 -f myagent/a4.conf -c conf -Dflume.root.logger=INFO,console
#定義agent名, source、channel、sink的名稱
a4.sources = r1
a4.channels = c1
a4.sinks = k1
#具體定義source
a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /root/training/logs
#具體定義channel
a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100
#定義攔截器,為訊息新增時間戳
a4.sources.r1.interceptors = i1
a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
#具體定義sink
a4.sinks.k1.type = hdfs
a4.sinks.k1.hdfs.path = hdfs://192.168.157.11:9000/flume/%Y%m%d
a4.sinks.k1.hdfs.filePrefix = events-
a4.sinks.k1.hdfs.fileType = DataStream
#不按照條數生成檔案
a4.sinks.k1.hdfs.rollCount = 0
#HDFS上的檔案達到128M時生成一個檔案
a4.sinks.k1.hdfs.rollSize = 134217728
#HDFS上的檔案達到60秒生成一個檔案
a4.sinks.k1.hdfs.rollInterval = 60
#組裝source、channel、sink
a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1
Flume的體系結構
二、HUE
三、ZooKeeper
Oracle和Mysql的區別