Storm kafka zookeeper 叢集

我們知道storm的作用主要是進行流式計算，對於源源不斷的均勻資料流流入處理是非常有效的，而現實生活中大部分場景並不是均勻的資料流，而是時而多時而少的資料流入，這種情況下顯然用批量處理是不合適的，如果使用storm做實時計算的話可能因為資料擁堵而導致伺服器掛掉，應對這種情況，使用kafka作為訊息佇列是非常合適的選擇，kafka可以將不均勻的資料轉換成均勻的訊息流，從而和storm比較完善的結合，這樣才可以實現穩定的流式計算。
　　storm和kafka結合，實質上無非是之前我們說過的計算模式結合起來，就是資料先進入kafka生產者，然後storm作為消費者進行消費，最後將消費後的資料輸出或者儲存到檔案、資料庫、分散式儲存等等，具體框圖如下：在這裡插入圖片描述

　　這張圖片摘自部落格地址：http://www.cnblogs.com/tovin/p/3974417.html 在此感謝作者的奉獻

一、環境安裝前準備：

（1）準備三臺機器：作業系統centos7
（2）JDK: jdk-8u191-linux-x64.tar.gz 可以到官網下載: wget https://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
（3）zookeeper：zookeeper-3.4.13 wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

（4）kafka: kafka_2.11-2.0.0 wget http://mirrors.hust.edu.cn/apache/kafka/2.0.0/kafka_2.11-2.0.0.tgz
（5）storm:apache-storm-1.2.2.tar.gz wget http://www.apache.org/dist/storm/apache-storm-1.2.2/apache-storm-1.2.2.tar.gz
（6）進行解壓配置環境變數 vi /ect/profile

# JAVA_HOME
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export CLASSPATH 
=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export ZOOKEEPER_HOME=/usr/local/java/zookeeper-3.4.13
export PATH=$PATH:$ZOOKEEPER_HOME/bin/:$JAVA_HOME/bin
#KAFKA_HOME
export KAFKA_HOME=/usr/local/java/kafka_2.11-2.0.0
export PATH=$PATH:$KAFKA_HOME/bin
# STORM_HOME
export STORM_HOME=/usr/local/java/apache-storm-1.2.2
export PATH=.:${JAVA_HOME}/bin:${ZK_HOME}/bin:${STORM_HOME}/bin:$PATH

環境變數需要重啟生效 source /ect/profile

二、zookeeper叢集安裝(三臺機器上都需要安裝)

（1）tar -zxvf zookeeper-3.4.13.tar.gz
（2）cd /usr/local/java/zookeeper-3.4.13/conf 進入解壓後zk conf目錄
（3）mv zoo_sample.cfg zoo.cfg 拷貝檔案為 zoo.cfg
（4）配置zoo.cfg

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/usr/local/java/zookeeper-3.4.13/dateDir
dataLogDir=/usr/local/java/zookeeper-3.4.13/logs
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1 = 0.0.0.0:2888:3888
server.2 = 192.168.164.134:2888:3888
server.3 = 192.168.164.135:2888:3888

（5）建立 mkdir dataDir=/usr/local/java/zookeeper-3.4.13/dateDir
（6）建立 mkdir dataLogDir=/usr/local/java/zookeeper-3.4.13/logs
（7）建立 echo “1” >/usr/local/java/zookeeper-3.4.13/dateDir/myid
（8）需要把zookeeper-3.4.13 這個目錄拷貝到其他兩臺機器上 scp -r zookeeper-3.4.13 [email protected]:/usr/local/java/ 等待輸入密碼即可
（9）server.2 和 server.3 相對應機器 /usr/local/java/zookeeper-3.4.13/dateDir/myid 改成 2 和 3
虛擬機器互相拷貝，新增IP ，輸入密碼
ssh -o StrictHostKeyChecking=no [email protected]
（10）啟動 ./bin/zkServer.sh start 三臺機器都需要啟動啟動過程會報錯，等待三臺都啟動成功後
./zkServer.sh status
注意：檢視zookeeper叢集的狀態，出現Mode:follower或是Mode:leader則代表成功

[[email protected] bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /usr/local/java/zookeeper-3.4.13/bin/../conf/zoo.cfg
Mode: follower
[[email protected] bin]# 

[[email protected] bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /usr/local/java/zookeeper-3.4.13/bin/../conf/zoo.cfg
Mode: leader
[[email protected] bin]# 

[[email protected] bin]# ./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /usr/local/java/zookeeper-3.4.13/bin/../conf/zoo.cfg
Mode: follower
[[email protected] bin]#

三、kafka叢集安裝（三臺機器上都需要安裝）

（1）tar -zxvf kafka_2.11-2.0.0.tgz
（2）cd /usr/local/java/kafka_2.11-2.0.0/config 進入解壓後 config 目錄
（3）vi server.properties 進行配置
（4）server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/usr/local/java/kafka_2.11-2.0.0/logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=hadoop1:2181,hadoop2:2181,hadoop3:2181/kafka

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

（5）建立 mkdir log.dirs=/usr/local/java/kafka_2.11-2.0.0/logs
（6）需要把kafka_2.11-2.0.0 這個目錄拷貝到其他兩臺機器上 scp -r kafka_2.11-2.0.0 [email protected]:/usr/local/java/ 等待輸入密碼即可
（7）要修改其他兩臺機器 server.properties broker.id=2 和 broker.id=3
ssh -o StrictHostKeyChecking=no [email protected]
（8）啟動

[[email protected] java]# cd kafka_2.11-2.0.0
[[email protected] kafka_2.11-2.0.0]# cd bin/
[[email protected] bin]#  ./bin/kafka-server-start.sh -daemon ./config/server.properties

四、storm叢集安裝（三臺機器上都需要安裝）

（1）tar -zxvf apache-storm-1.2.2.tar.gz
（2）cd /usr/local/java/apache-storm-1.2.2/conf 進入解壓後conf 目錄
（3）vi storm.yaml 進行配置
（4）storm.yaml

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

########### These MUST be filled in for a storm configuration
 storm.zookeeper.servers:
     - "hadoop1"
     - "hadoop2"
     - "hadoop3"
 storm.zookeeper.port: 2181
 nimbus.seeds: ["hadoop1"]
 storm.local.dir: "/usr/local/java/apache-storm-1.2.2/logs"
 supervisor.slots.ports:
     - 6700
     - 6701
     - 6702
     - 6703
# nimbus.seeds: ["host1", "host2", "host3"]
# 
# 
# ##### These may optionally be filled in:
#    
## List of custom serializations
# topology.kryo.register:
#     - org.mycompany.MyType
#     - org.mycompany.MyType2: org.mycompany.MyType2Serializer
#
## List of custom kryo decorators
# topology.kryo.decorators:
#     - org.mycompany.MyDecorator
#
## Locations of the drpc servers
# drpc.servers:
#     - "server1"
#     - "server2"

## Metrics Consumers
## max.retain.metric.tuples
## - task queue will be unbounded when max.retain.metric.tuples is equal or less than 0.
## whitelist / blacklist
## - when none of configuration for metric filter are specified, it'll be treated as 'pass all'.
## - you need to specify either whitelist or blacklist, or none of them. You can't specify both of them.
## - you can specify multiple whitelist / blacklist with regular expression
## expandMapType: expand metric with map type as value to multiple metrics
## - set to true when you would like to apply filter to expanded metrics
## - default value is false which is backward compatible value
## metricNameSeparator: separator between origin metric name and key of entry from map
## - only effective when expandMapType is set to true
# topology.metrics.consumer.register:
#   - class: "org.apache.storm.metric.LoggingMetricsConsumer"
#     max.retain.metric.tuples: 100
#     parallelism.hint: 1
#   - class: "org.mycompany.MyMetricsConsumer"
#     max.retain.metric.tuples: 100
#     whitelist:
#       - "execute.*"
#       - "^__complete-latency$"
#     parallelism.hint: 1
#     argument:
#       - endpoint: "metrics-collector.mycompany.org"
#     expandMapType: true
#     metricNameSeparator: "."

## Cluster Metrics Consumers
# storm.cluster.metrics.consumer.register:
#   - class: "org.apache.storm.metric.LoggingClusterMetricsConsumer"
#   - class: "org.mycompany.MyMetricsConsumer"
#     argument:
#       - endpoint: "metrics-collector.mycompany.org"
#
# storm.cluster.metrics.consumer.publish.interval.secs: 60

# Event Logger
# topology.event.logger.register:
#   - class: "org.apache.storm.metric.FileBasedEventLogger"
#   - class: "org.mycompany.MyEventLogger"
#     arguments:
#       endpoint: "event-logger.mycompany.org"

# Metrics v2 configuration (optional)
#storm.metrics.reporters:
#  # Graphite Reporter
#  - class: "org.apache.storm.metrics2.reporters.GraphiteStormReporter"
#    daemons:
#        - "supervisor"
#        - "nimbus"
#        - "worker"
#    report.period: 60
#    report.period.units: "SECONDS"
#    graphite.host: "localhost"
#    graphite.port: 2003
#
#  # Console Reporter
#  - class: "org.apache.storm.metrics2.reporters.ConsoleStormReporter"
#    daemons:
#        - "worker"
#    report.period: 10
#    report.period.units: "SECONDS"
#    filter:
#        class: "org.apache.storm.metrics2.filters.RegexFilter"
#        expression: ".*my_component.*emitted.*"

（5）建立 mkdir /usr/local/java/apache-storm-1.2.2/logs
（6）需要把apache-storm-1.2.2 這個目錄拷貝到其他兩臺機器上 scp -r kafka_2.11-2.0.0 [email protected]:/usr/local/java/ 等待輸入密碼即可
（7）啟動 storm

#在192.168.164.133  啟動 
[[email protected] apache-storm-1.2.2]# cd bin/
[[email protected] bin]# ./storm nimbus >/dev/null 2>&1 &

[[email protected] apache-storm-1.2.2]# cd bin/
[[email protected] bin]# ./storm ui &

在其他兩臺機器啟動

#在192.168.164.134, 192.168.164.135 啟動 
[[email protected] apache-storm-1.2.2]# cd bin/
[[email protected] bin]# ./storm supervisor >/dev/null 2>&1 &

（8）訪問 http://192.168.164.133:8080/
在這裡插入圖片描述

五、虛擬機器 centos7 一些注意

（1）修改了hosts 需要重啟 service network restart

127.0.0.1   hadoop1
192.168.164.134 hadoop2
192.168.164.135 hadoop3

（2）防火牆配置

1、通過systemctl status firewalld檢視firewalld狀態，發現當前是dead狀態，即防火牆未開啟
2、通過systemctl start firewalld開啟防火牆，沒有任何提示即開啟成功。
3、再次通過systemctl status firewalld檢視firewalld狀態，顯示running即已開啟了
4、systemctl stop firewalld 關閉防火牆
5、開啟以下埠
			 firewall-cmd --zone=public --add-port=2888/tcp --permanent
			 firewall-cmd --zone=public --add-port=3888/tcp --permanent
 		 	 firewall-cmd --zone=public --add-port=2181/tcp --permanent
 		 	 firewall-cmd --zone=public --add-port=8080/tcp --permanent
 		 	  firewall-cmd --zone=
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    大資料處理框架之:Storm + Kafka + zookeeper 叢集
       
  
  
 Storm kafka zookeeper 叢集 
 我們知道storm的作用主要是進行流式計算，對於源源不斷的均勻資料流流入處理是非常有效的，而現實生活中大部分場景並不是均勻的資料流，而是時而多時而少的資料流入，這種情況下顯然用批量處理是不合適的，如果使用storm做實時計算的話可能因為資 

  
 

    

    
    大資料處理框架: Flume + Redis4.0.11 叢集
       
  
  
 上一篇文章關於Storm kafka Zookeeper 叢集、本次加入Flume Redis 的叢集 
 Apache Flume是一個分散式，可靠且可用的系統，用於高效地收集，彙總和將來自多個不同源的大量日誌資料移動到集中式資料儲存。 Apache Flume的使用不僅限於日誌資料聚合。 

  
 

    

    
    最主流的五個大資料處理框架的優勢對比
       
 
 
 我深入分析了五個大資料處理框架：Hadoop，Spark，Flink，Storm，Samaza 
 
  Hadoop 
 
 頂尖的框架之一，大資料的代名詞。Hadoop，MapReduce，以及其生態系統和相關的技術，比如Pig，Hive，Flume，HDFS等。Hadoop是第一個，在工業 

  
 

    

    
    大資料處理基礎之利用hadoop寫的簡單mapreduce案例
       
  
  
 案例：  需要處理的資料：  13877779999 bj zs 2145  13766668888 sh ls 1028  13766668888 sh ls 9987  13877779999 bj zs 5678  13544445555 sz ww 10577  1387777999 

  
 

    

    
    大資料協作框架之Oozie
       
 
 
 一、配置： 
 1、下載並解壓： 
 http://archive.cloudera.com/cdh5/cdh/5/oozie-4.1.0-cdh5.14.2.tar.gz 
 tar -zxvf oozie-4.1.0-cdh5.14.2.tar.gz -C /opt/cdh5.14.2/ 
 

  
 

    

    
    大資料協作框架之Flume
       
 
 
 一、概述 
 Flume是Cloudera提供的一個高可用的，高可靠的，分散式的海量日誌採集、聚合和傳輸的系統，Flume支援在日誌系統中定製各類資料傳送方，用於收集資料；同時，Flume提供對資料進行簡單處理，並寫到各種資料接受方（可定製）的能力。 
       

  
 

    

    
    大資料協作框架之Sqoop
       
 
 
 一、概述： 
     1、Sqoop：SQL-to-Hadoop 
     2、連線傳統關係型資料庫和Hadoop的橋樑： 
      a、把關係型資料庫的資料匯入到Hadoop與其相關的系統中（如Hive，Hbase） 

  
 

    

    
    大資料處理框架分類與選擇
      （一）大資料處理框架分類不論是系統中存在的歷史資料，還是持續不斷接入系統中的實時資料，只要資料是可訪問的，我們就可以對資料進行處理。按照對所處理的資料形式和得到結果的時效性分類，資料處理框架可以分為兩類：批處理系統流處理系統批處理是一種用來計算大規模資料集的方法。批處理的過程包括將任務分解為較小的任務，分別在 

  
 

    

    
    大資料處理過程之核心技術ETL詳解
      
                
核心技術
架構挑戰：
1、對現有資料庫管理技術的挑戰。
2、經典資料庫技術並沒有考慮資料的多類別（variety）、SQL（結構化資料查詢語言），在設計的一開始是沒有考慮到非結構化資料的儲存問題。
3、實時性技術的挑戰：一般而言，傳統資料倉庫系統，BI應用，對處理時間的要求 

  
 

    

    
    大資料（三十）：zookeeper叢集與kafka叢集部署
      
                一、安裝Zookeeper

1.叢集規劃

在hadoop102、hadoop103和hadoop104三個節點上部署Zookeeper。

2.解壓安裝

       1.解壓zookeeper安裝包到/usr/local/目錄下

tar -zxvf zookeepe 

  
 

    

    
    大資料協作框架之flume詳解
      flume的安裝配置
    1、下載
    2、加壓
        $tar zxf /sourcepath/ -C /copypath
    3、配置flumu-env.sh檔案
        exprt JAVA_HOME=/jdkpath
    4、啟動
        $bin/flume 

  
 

    

    
    大資料環境基礎之Centos安裝Haoop叢集（5）安裝hadoop叢集
      
                
首先要去下載hadoop-2.5.2.tar.gz安裝包，將安裝包移動到當前使用者的根目錄解壓


用命令ls檢視解壓後的hadoop


配置hadoop環境變數

配置jdk路徑，終端輸入 vi   hadoop-env.sh


配置yarn環境變數 yarn-env 

  
 

    

    
    大資料環境基礎之Centos安裝Haoop叢集（4）ssh免密碼登入
      
                
ssh免密碼登入的配置需要在當前使用者下的根目錄下
用命令ssh-keygen -t rsa生成公鑰和祕鑰

用命令ls -a檢視.ssh檔案

進入.ssh資料夾裡面，用命令ls檢視公鑰和祕鑰

將公鑰儲存在authorized_keys檔案中

修改authorized 

  
 

    

    
    大資料環境基礎之Centos安裝Haoop叢集（1）CentOS系統配置
      
                
首先開啟兩個節點

接下來就是開啟終端了
方式一：在桌面上右擊，找到Open in Terminal,點選

方式二：

這樣終端就打開了，如下圖
了
接下來就正式開始配置系統環境
1.配置網路
我們能看到右端的電腦有個紅叉，說明網路是斷開的，其實我們可以點進去，然後點選S 

  
 

    

    
    Storm之——Storm+Kafka+Flume+Zookeeper+MySQL實現資料實時分析(環境搭建篇)
       
 
 
 Storm之——Storm+Kafka+Flume+Zookeeper+MySQL實現資料實時分析(環境搭建篇) 
 2018年03月04日 23:05:29 冰 河 閱讀數：1602更多 
 所屬專欄： Hadoop生態 
 版權宣告：本文為博主原創文章，未經博主允許不得轉載。 https:/ 

  
 

    

    
    流式大資料處理 （實時）的三種框架：Storm，Spark和Samza
      
                
摘要：許多分散式計算系統都可以實時或接近實時地處理大資料流。本文將對Storm、Spark和Samza等三種Apache框架分別進行簡單介紹，然後嘗試快速、高度概述其異同。
許多分散式計算系統都可以實時或接近實時地處理大資料流。本文將對三種Apache框架分別進行簡單介紹， 

  
 

    

    
    流式大資料處理的三種框架：Storm，Spark和Samza
      
                許多分散式計算系統都可以實時或接近實時地處理大資料流。本文將對三種Apache框架分別進行簡單介紹，然後嘗試快速、高度概述其異同。Apache Storm在Storm中，先要設計一個用於實時計算的圖狀結構，我們稱之為拓撲（topology）。這個拓撲將會被提交給叢集，由叢集中 

  
 

    

    
    [BigData]流式大資料處理的三種框架：Storm，Spark和Samza
      

許多分散式計算系統都可以實時或接近實時地處理大資料流。本文將對三種Apache框架分別進行簡單介紹，然後嘗試快速、高度概述其異同。

Apache Storm

在Storm中，先要設計一個用於實時計算的圖狀結構，我們稱之為拓撲（topology）。這個拓撲將會被提交給叢集，由叢集中的主控節點（maste 

  
 

    

    
    流式大資料處理的三種框架：Storm，Spark和Flink
      
								
								            
						
                
storm、spark streaming、flink都是開源的分散式系統，具有低延遲、可擴充套件和容錯性諸多優點，允許你在執行資料流程式碼時，將任務分配到一系列具有容錯能力的計算機上並行執行,都提供 

  
 

    

    
    Storm之——Storm+Kafka+Flume+Zookeeper+MySQL實現資料實時分析(程式案例篇)
      
                一、前言二、簡單介紹為了方便，這裡我們只是簡單的向/home/flume/log.log中追加單詞，每行一個單詞，利用Storm接收每個單詞，將單詞計數更新到資料庫，具體的邏輯為，如果資料庫中沒有相關單詞，則將資料插入資料庫，如果存在相關單詞，則更新資料庫中的計數。具體SQL