1. 程式人生 > >大資料_資料採集引擎(Sqoop和Flume)

大資料_資料採集引擎(Sqoop和Flume)

一、資料採集引擎

    1、準備實驗環境: 準備Oracle資料庫
                      使用者:sh  表:sales 訂單表(92萬)

    2、Sqoop:採集關係型資料庫中的資料
               用在離線計算的應用中
               強調:批量
               (1)資料交換引擎: RDBMS  <--->  Sqoop <---> HDFS、HBase、Hive2)底層依賴MapReduce
               (3)依賴JDBC
               (4
)安裝:tar -zxvf sqoop-1.4.5.bin__hadoop-0.23.tar.gz -C ~/training/ 設定環境變數: SQOOP_HOME=/root/training/sqoop-1.4.5.bin__hadoop-0.23 export SQOOP_HOME PATH=$SQOOP_HOME/bin:$PATH export PATH 注意:如果是Oracle資料庫,大寫:使用者名稱、表名、列名 (*)codegen Generate code to
interact with database records 根據表結構自動生成對應Java類 sqoop codegen --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --outdir /root/sqoop (*)create-hive-table Import a table definition into Hive (*)eval Evaluate a
SQL statement and display the results 在Sqoop中執行SQL sqoop eval --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --query 'select * from emp' (*)export Export an HDFS directory to a database table (*)help List available commands (*)import Import a table from a database to HDFS 匯入資料 (1)匯入EMP表的所有資料(HDFS上) sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --target-dir /sqoop/import/emp1 (2)匯入指定的列 sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger --table EMP --columns ENAME,SAL --target-dir /sqoop/import/emp2 (3) 匯入訂單表 sqoop import --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SH --password sh --table SALES --target-dir /sqoop/import/sales -m 1 錯誤:ERROR tool.ImportTool: Error during import: No primary key could be found for table SALES. Please specify one with --split-by or perform a sequential import with '-m 1'. (*)import-all-tables Import tables from a database to HDFS 匯入某個使用者下所有的表,預設路徑:/user/root sqoop import-all-tables --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SCOTT --password tiger (*)job Work with saved jobs (*)list-databases List available databases on a server (*) MySQL資料庫:就是資料庫的名字 (*) Oracle資料庫:是資料庫中所有使用者的名字 sqoop list-databases --connect jdbc:oracle:thin:@192.168.157.163:1521/orcl --username SYSTEM --password password (*)list-tables List available tables in a database (*)merge Merge results of incremental imports (*)metastore Run a standalone Sqoop metastore (*)version Display version information 3、Flume:採集日誌 用在實時計算(流式計算)的應用中 強調:實時 #bin/flume-ng agent -n a4 -f myagent/a4.conf -c conf -Dflume.root.logger=INFO,console #定義agent名, source、channel、sink的名稱 a4.sources = r1 a4.channels = c1 a4.sinks = k1 #具體定義source a4.sources.r1.type = spooldir a4.sources.r1.spoolDir = /root/training/logs #具體定義channel a4.channels.c1.type = memory a4.channels.c1.capacity = 10000 a4.channels.c1.transactionCapacity = 100 #定義攔截器,為訊息新增時間戳 a4.sources.r1.interceptors = i1 a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #具體定義sink a4.sinks.k1.type = hdfs a4.sinks.k1.hdfs.path = hdfs://192.168.157.11:9000/flume/%Y%m%d a4.sinks.k1.hdfs.filePrefix = events- a4.sinks.k1.hdfs.fileType = DataStream #不按照條數生成檔案 a4.sinks.k1.hdfs.rollCount = 0 #HDFS上的檔案達到128M時生成一個檔案 a4.sinks.k1.hdfs.rollSize = 134217728 #HDFS上的檔案達到60秒生成一個檔案 a4.sinks.k1.hdfs.rollInterval = 60 #組裝source、channel、sink a4.sources.r1.channels = c1 a4.sinks.k1.channel = c1

Flume的體系結構

這裡寫圖片描述


二、HUE

三、ZooKeeper

Oracle和Mysql的區別

這裡寫圖片描述