大資料技術學習筆記之網站流量日誌分析專案：Flume日誌採集系統1

阿新 • • 發佈：2018-12-07

一、網站日誌流量專案
   -》專案開發階段：
       -》可行性分析
       -》需求分析
       -》詳細設計
       -》程式碼實現
       -》測試
       -》上線
   -》大資料業務流程
       -》資料採集：sqoop、Flume、kafka、Logstash
           -》資料來源：日誌檔案、RDBMS、實時的資料流
           -》目標地：hdfs、nosql、Hive
       -》資料儲存：入庫的過程
       -》資料計算：hive、MapReduce、spark
           -》資料清洗
           -》資料建模
           -》資料分析
       -》資料展示：java web、視覺化分析工具


二、Flume的使用
   -》特點
       collecting, aggregating, and moving
       收集       聚集           移動
       source、   channel、       sink
   -》flume原理
       -》source：負責讀資料來源，將資料來源的資料變成資料流，封裝成event
           event是資料採集的最小單元，
               head：放一些配置資訊key=value格式
               body：真正的資料
       -》channel:負責臨時儲存資料
       -》sink：負責將資料傳送到目標地


三、Flume的部署
   -》下載解壓
       tar -zxvf flume-ng-1.6.0-cdh5.7.6.tar.gz -C /opt/cdh-5.7.6/
   -》修改配置檔案
       mv conf/flume-env.sh.template conf/flume-env.sh
       export JAVA_HOME=/opt/modules/jdk1.8.0_91

       -》如何找到hdfs
           -》配置全域性環境變數：HADOOP_HOME
           -》在配置檔案中申明HADOOP_HOME
           -》在agent中寫明hdfs的絕對地址
               hdfs://hostname:8020/flume
               -》如果hdfs配置了HA
                   -》將core-site和hdfs-site拷貝到flume的配置檔案目錄

   -》將core-site和hdfs-site拷貝到flume的配置檔案目錄
       cp ../hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml ../hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml conf/

   -》將flume寫資料進入hdfs需要的jar包匯入flume的lib目錄
           commons-configuration-1.6.jar
           hadoop-auth-2.6.0-cdh5.7.6.jar
           hadoop-common-2.6.0-cdh5.7.6.jar
           hadoop-hdfs-2.6.0-cdh5.7.6.jar
           htrace-core4-4.0.1-incubating.jar


四、使用flume
   -》flume的執行方式：
       flume-og：老版本
       flume-ng：新版本
       Usage: bin/flume-ng <command> [options]...
           bin/flume-ng agent --conf $flume_conf_dir --name agent_name --conf-file agent_file_path -Dflume.root.logger=INFO,console

   -》案例一：讀hive的日誌，將hive的日誌採集到logger中
       agent：
           source：讀hive日誌，將日誌資料傳送給channel
           channel：儲存source發過來的資料，記憶體
           sink：從channel中取資料，將資料傳送給日誌

       -》執行
           bin/flume-ng agent --conf conf/ --name a1 --conf-file case/hive-mem-log.properties -Dflume.root.logger=INFO,console

   -》案例二：使用file channel
       bin/flume-ng agent --conf conf/ --name a1 --conf-file case/hive-file-log.properties -Dflume.root.logger=INFO,console

       mem：讀寫比較快，資料容易丟失
       file：相對來說速度慢一些，但是資料安全性較高

   -》案例三：將資料採集到hdfs
       bin/flume-ng agent --conf conf/ --name a1 --conf-file case/hive-mem-hdfs.properties -Dflume.root.logger=INFO,console

       -》配置檔案大小
           -》按照時間生成檔案
               hdfs.rollInterval=0
           -》按照檔案大小生成檔案：預設1024位元組
               hdfs.rollSize=10240 （工作中一般給125M左右對應的位元組數）
           -》按照event個數生成檔案
               hdfs.rollCount=0

           bin/flume-ng agent --conf conf/ --name a1 --conf-file case/hive-mem-size.properties -Dflume.root.logger=INFO,console


       -》按照時間生成對應的目錄
           bin/flume-ng agent --conf conf/ --name a1 --conf-file case/hive-mem-part.properties -Dflume.root.logger=INFO,console

       -》設定檔名頭部：hdfs.filePrefix
       -》設定執行緒超時時間：hdfs.idleTimeout

   -》案例四：
       logs/2018-04-02.log
           2018-04-03.log
           2018-04-04.log

       -》spooling dir source：用於動態的讀取目錄中的檔案
           執行：
       bin/flume-ng agent --conf conf/ --name a1 --conf-file case/dir-mem-size.properties -Dflume.root.logger=INFO,console

       logs/2018-04-02.log.tmp -> 2018-04-02.log
           2018-04-03.log.tmp
           2018-04-04.log

   -》案例五：
       logs/2018-04-02.log
           2018-04-03.log
           2018-04-04.log

       -》使用taildir source
           -》如果使用老版本的flume，沒有該功能，需要自己編譯taildir的原始碼
           -》找到flume-1.7中taildir source的原始碼
           -》匯入eclipse
               -》缺少一個類檔案
                   C:\Users\江城子\Desktop\Git\flume\flume-ng-core\src\main\java\org\apache\flume\source\PollableSourceConstants.java
               -》刪除兩個overwrite註解
           -》maven編譯，打成jar包
           -》將jar包放入lib目錄


   -》flume常用的元件型別
       -》source：avro source/sink ,kafka source ,exec source ,spooldir source   ,taildir source
       -》channel：file、mem、kafka
       -》sink：kafka、hdfs、hive

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'a1'

# define agent
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# define source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume/spooling
a1.sources.s1.ignorePattern = ([^ ]*\.tmp$)

# define channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# define sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/spoolingdir
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollCount = 0

# bond
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

大資料技術學習筆記之網站流量日誌分析專案：Flume日誌採集系統1

大資料技術學習筆記之網站流量日誌分析專案：資料採集層的實現3

大資料技術學習筆記之網站流量日誌分析專案：網站業務與企業架構2

大資料技術學習筆記之網站流量日誌分析專案：Flume日誌採集系統1

大資料技術學習筆記之Hadoop框架基礎3-網站日誌分析及MapReduce過程詳解

大資料技術學習筆記之hive框架基礎3-sqoop工具的使用及具體業務分析

大資料技術學習筆記之hive框架基礎2-hive中常用DML和UDF和連線介面使用

大資料技術學習筆記之hive框架基礎1-基本架構及環境部署

大資料技術學習筆記之Hadoop框架基礎5-Hadoop高階特性HA及二次排序思想

大資料技術學習筆記之Hadoop框架基礎1-Hadoop介紹及偽分散式部署

大資料技術學習筆記之Hadoop框架基礎2-MapReduce程式設計及執行流程

大資料技術學習筆記之Hadoop框架基礎4-MapReduceshuffer過程詳解及zookeeper框架學習

大資料技術學習筆記之linux基礎3-軟體管理與shell指令碼開發

大資料技術學習筆記之linux基礎2-基礎環境與系統管理

大資料技術學習筆記之linux基礎1-基礎環境與基礎命令

Hadoop源碼學習筆記之NameNode啟動流程分析二：http server啟動源碼剖析

流媒體技術學習筆記之（八）海康、大華IpCamera RTSP地址和格式

流媒體技術學習筆記之（三）Nginx-Rtmp-Module統計某頻道在線觀看流的客戶數

大資料學習路線圖讓你精準掌握大資料技術學習

資料結構學習筆記之線性表

大資料Hadoop學習筆記（三）

大資料技術學習筆記之網站流量日誌分析專案：Flume日誌採集系統1

相關推薦