Flume日誌採集多級Agent
flume的簡介
任何一個系統在執行的時候都會產生大量的日誌資訊,我們需要對這些日誌進行分析,在分析日誌之前,我們需要將分散在生產系統中的日誌收集起來。Flume就是這樣的日誌採集系統
主要有三個元件:
Source:消費web系統這樣的外部資料來源中的資料(一般就是web系統產生的日誌),外部資料來源會向flume傳送某種能被flume識別的格式的事件,有以下幾種型別:avro 、exec、jms、spooling directory source、kafka、netcat等
Channel:當flume source從外部source讀取到資料的時候,flume會將資料先存放在一個或多個channel中,這些資料將會一直被存放在channel中直到它被sink消費了為止,channel的主要型別有:memory、jdbc、kafka、file等
Sink:消費channel中的資料,然後將其存放進外部持久化的檔案系統中,Sink的型別主要有HDFS、Hive、Avro、File Roll、kafka、HBase、ElasticSearch
下面是apache官網的解釋:
A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.
Agent的多級相連
有些情況下,可能會有很多的伺服器產生大量的日誌檔案,此時我們需要先準備一些第一層級的一些flume收集系統,這些flume日誌收集系統需要有一個型別為avro的sink,這個sink採集到資料後,會將資料傳送出去,具體傳送到哪,需要在配置檔案裡面進行配置,然後再下一層極的flume收集系統中,採用avro型別的source,收集上一級傳送過來的日誌資訊,然後再在這一層集中將資料沉入到hdfs檔案系統中
案例:模擬多Agent的日誌採集
假設有3臺主機,bigdata02,bigdata03,bigdata04,bigdata02和bigdata04用來作為第一級的採集,bigdata03作為第二級的採集
首先bigdata02和bigdata04的fluem的配置檔案tail-avro.conf
#執行的命令bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
########
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/flumedata/log/test.log
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = bigdata03
a1.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
該配置用來追蹤檔案/home/hadoop/flumedata/log/test.log,一旦有新的資訊被寫入,那麼就會被採集到flume中
bigdata03的配置avro-file_roll.conf
#從avro埠接收資料,下沉到file
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /tmp/test
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.rollInterval = 10
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
該配置用來從avro接受訊息,作為服務端,所以需要繫結自己的所有的ip地址,使用0.0.0.0,採集到所有的資料後將資料寫入到本地檔案系統中,如果需要寫入hdfs中,可以修改這邊的sink的type為hdfs,然後再修改相應的sink的配置檔案即可。具體的配置檔案的修改方法參見官網Apache Flume Documention
在bigdata02和bigdata04上使用shell迴圈模擬日誌的產生
#!/bin/bash
i=1
while [ "1" = "1" ]
do
echo "Flume get log to hdfs bigdata02"$i >> /home/hadoop/flumedata/log/test.log
sleep 0.3
i=` expr $i + 1 `
done
#!/bin/bash
i=1
while [ "1" = "1" ]
do
echo "Flume get log to hdfs bigdata04"$i >> /home/hadoop/flumedata/log/test.log
sleep 0.3
i=` expr $i + 1 `
done
執行的結果如下:
我們可以看到檔案裡面的資料的確有來自於bigdata02和來自於bigdata04的