SparkStreaming之Accumulators和Broadcast

阿新 • • 發佈：2019-02-07

1、Accumulators和Broadcast基礎理解

共享變數

共享變數目的是將一個變數快取在每臺機器上，而不用在任務之間傳遞。在SparkCore中經常廣播一些環境變數，

目的是使得在同一時間叢集中的每臺機器的環境變數都更新。它的功能是用於有效地給每個節點輸入一個環境變數

或者資料集副本，這樣可以減少通訊的開銷。這樣使得我們在多個任務之間使用相同資料的時候，建立廣播變數結

合併行處理，這樣可以加快處理。下面通過原始碼來分析一下Accumulators和Broadcast

廣播變數（Broadcast）

package org.apache.spark.broadcast

import java.io.Serializable

import scala.reflect.ClassTag

import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import org.apache.spark.util.Utils

/**
 * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable
 * cached on each machine rather than shipping a copy of it with tasks. They can be used, for
 * example, to give every node a copy of a large input dataset in an efficient manner. Spark also
 * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
 * communication cost.
 * 廣播變數允許程式設計師將一個只讀的變數快取在每臺機器上，而不是複製一份資料在task執行。它可以被允許有效
 * 的給每個節點一個大資料集的副本。Spark還嘗試高效的演算法來廣播變數，以減少通宵消耗
 * Broadcast variables are created from a variable `v` by calling
 * [[org.apache.spark.SparkContext#broadcast]].
 * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the
 * `value` method. The interpreter session below shows this:
 *
 * {{{
 * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
 * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
 *
 * scala> broadcastVar.value
 * res0: Array[Int] = Array(1, 2, 3)
 * }}}
 *
 * After the broadcast variable is created, it should be used instead of the value `v` in any
 * functions run on the cluster so that `v` is not shipped to the nodes more than once.
 * In addition, the object `v` should not be modified after it is broadcast in order to ensure
 * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped
 * to a new node later).
 *  廣播變數之後，它可以應用在叢集中的任何函式，為了保證所有節點得到相同的廣播值，它的數值是不可以改變的
 * @param id A unique identifier for the broadcast variable.
 * @tparam T Type of the data contained in the broadcast variable.
 */
abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {

  /**
   * Flag signifying whether the broadcast variable is valid
   * (that is, not already destroyed) or not.
   */
  @volatile private var _isValid = true

  private var _destroySite = ""

  /** Get the broadcasted value. */
  def value: T = {
    assertValid()
    getValue()
  }

  /**
   * Asynchronously delete cached copies of this broadcast on the executors.
   * If the broadcast is used after this is called, it will need to be re-sent to each executor.
   */
  def unpersist() {
    unpersist(blocking = false)
  }

  /**
   * Delete cached copies of this broadcast on the executors. If the broadcast is used after
   * this is called, it will need to be re-sent to each executor.
   * @param blocking Whether to block until unpersisting has completed
   */
  def unpersist(blocking: Boolean) {
    assertValid()
    doUnpersist(blocking)
  }


  /**
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * This method blocks until destroy has completed
   */
  def destroy() {
    destroy(blocking = true)
  }

  /**
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * @param blocking Whether to block until destroy has completed
   */
  private[spark] def destroy(blocking: Boolean) {
    assertValid()
    _isValid = false
    _destroySite = Utils.getCallSite().shortForm
    logInfo("Destroying %s (from %s)".format(toString, _destroySite))
    doDestroy(blocking)
  }

  /**
   * Whether this Broadcast is actually usable. This should be false once persisted state is
   * removed from the driver.
   */
  private[spark] def isValid: Boolean = {
    _isValid
  }

  /**
   * Actually get the broadcasted value. Concrete implementations of Broadcast class must
   * define their own way to get the value.
   */
  protected def getValue(): T

  /**
   * Actually unpersist the broadcasted value on the executors. Concrete implementations of
   * Broadcast class must define their own logic to unpersist their own data.
   */
  protected def doUnpersist(blocking: Boolean)

  /**
   * Actually destroy all data and metadata related to this broadcast variable.
   * Implementation of Broadcast class must define their own logic to destroy their own
   * state.
   */
  protected def doDestroy(blocking: Boolean)

  /** Check if this broadcast is valid. If not valid, exception is thrown. */
  protected def assertValid() {
    if (!_isValid) {
      throw new SparkException(
        "Attempted to use %s after it was destroyed (%s) ".format(toString, _destroySite))
    }
  }

  override def toString: String = "Broadcast(" + id + ")"
}

累加器（Accumulators）

package org.apache.spark

/**
 * A simpler value of [[Accumulable]] where the result type being accumulated is the same
 * as the types of elements being merged, i.e. variables that are only "added" to through an
 * associative and commutative operation and can therefore be efficiently supported in parallel.
 * They can be used to implement counters (as in MapReduce) or sums. Spark natively supports
 * accumulators of numeric value types, and programmers can add support for new types.
 *	1、累加器僅僅支援累加操作（added）,目的是有效的支援並行
	2、他們可以用來進行計數和累加，spark天生支援數值型別的累加，同時程式設計師也可以自己定義型別
 * An accumulator is created from an initial value `v` by calling
 * [[SparkContext#accumulator SparkContext.accumulator]].
 * Tasks running on the cluster can then add to it using the [[Accumulable#+= +=]] operator.
 * However, they cannot read its value. Only the driver program can read the accumulator's value,
 * using its [[#value]] method.
 *
 * The interpreter session below shows an accumulator being used to add up the elements of an array:
 *
 * {{{
 * scala> val accum = sc.accumulator(0)
 * accum: org.apache.spark.Accumulator[Int] = 0
 *
 * scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
 * ...
 * 10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
 *
 * scala> accum.value
 * res2: Int = 10
 * }}}
 *
 * @param initialValue initial value of accumulator
 * @param param helper object defining how to add elements of type `T`
 * @param name human-readable name associated with this accumulator
 * @param countFailedValues whether to accumulate values from failed tasks
 * @tparam T result type
*/
@deprecated("use AccumulatorV2", "2.0.0")
class Accumulator[T] private[spark] (
    // SI-8813: This must explicitly be a private val, or else scala 2.11 doesn't compile
    @transient private val initialValue: T,
    param: AccumulatorParam[T],
    name: Option[String] = None,
    countFailedValues: Boolean = false)
  extends Accumulable[T, T](initialValue, param, name, countFailedValues)


/**
 * A simpler version of [[org.apache.spark.AccumulableParam]] where the only data type you can add
 * in is the same type as the accumulated value. An implicit AccumulatorParam object needs to be
 * available when you create Accumulators of a specific type.
 *
 * @tparam T type of value to accumulate
 */
@deprecated("use AccumulatorV2", "2.0.0")
trait AccumulatorParam[T] extends AccumulableParam[T, T] {
  def addAccumulator(t1: T, t2: T): T = {
    addInPlace(t1, t2)
  }
}


@deprecated("use AccumulatorV2", "2.0.0")
object AccumulatorParam {

  // The following implicit objects were in SparkContext before 1.2 and users had to
  // `import SparkContext._` to enable them. Now we move them here to make the compiler find
  // them automatically. However, as there are duplicate codes in SparkContext for backward
  // compatibility, please update them accordingly if you modify the following implicit objects.

  @deprecated("use AccumulatorV2", "2.0.0")
  implicit object DoubleAccumulatorParam extends AccumulatorParam[Double] {
    def addInPlace(t1: Double, t2: Double): Double = t1 + t2
    def zero(initialValue: Double): Double = 0.0
  }

  @deprecated("use AccumulatorV2", "2.0.0")
  implicit object IntAccumulatorParam extends AccumulatorParam[Int] {
    def addInPlace(t1: Int, t2: Int): Int = t1 + t2
    def zero(initialValue: Int): Int = 0
  }

  @deprecated("use AccumulatorV2", "2.0.0")
  implicit object LongAccumulatorParam extends AccumulatorParam[Long] {
    def addInPlace(t1: Long, t2: Long): Long = t1 + t2
    def zero(initialValue: Long): Long = 0L
  }

  @deprecated("use AccumulatorV2", "2.0.0")
  implicit object FloatAccumulatorParam extends AccumulatorParam[Float] {
    def addInPlace(t1: Float, t2: Float): Float = t1 + t2
    def zero(initialValue: Float): Float = 0f
  }

  // Note: when merging values, this param just adopts the newer value. This is used only
  // internally for things that shouldn't really be accumulated across tasks, like input
  // read method, which should be the same across all tasks in the same stage.
  @deprecated("use AccumulatorV2", "2.0.0")
  private[spark] object StringAccumulatorParam extends AccumulatorParam[String] {
    def addInPlace(t1: String, t2: String): String = t2
    def zero(initialValue: String): String = ""
  }
}

對其大概有個瞭解，我們再來看下面實驗程式

import org.apache.spark.{SparkConf, SparkContext}

object broadCastTest {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("broadCastTest").setMaster("local")
    val sc = new SparkContext(conf)


    val RDD = sc.parallelize(List(1,2,3))

    //broadcast
    val broadValue1 = sc.broadcast(2)
    val data1 = RDD.map(x => x*broadValue1.value)
    data1.foreach(x => println("broadcast value:"+x))


    //accumulator
    var accumulator = sc.accumulator(2)
    //錯誤
    val RDD2 = sc.parallelize(List(1,1,1)).map{ x=>
      if(x<3){
        accumulator+=1
      }
      x*accumulator.value
    }//(x => x*accumulator.value)
    //此處還沒有報錯
    println(RDD2)
    //此處開始報錯
    //RDD2.foreach(println)
    //  這裡報錯：Can't read accumulator value in task

    //這個操作沒有報錯
    RDD.foreach{x =>
      if(x<3){
        accumulator+=1
      }
    }
    println("accumulator is "+accumulator.value)
    // accumulator 說明了兩點：
    //（1）： 累加器只有在執行Action的時候，才被更新
    //（2）：我們在task的時候不能讀取它的值，只有驅動程式才可以讀取它的值

    sc.stop()
  }

}

從Accumulator原始碼中可以看到，我們可以用AccumulatorParam介面實現自己的累加器
它有兩個方法，
def addInPlace(t1: T, t2: T): T = t1 + t2
def zero(initialValue: T): T = 0.0
下面按照自己定義的型別，寫一個

import org.apache.spark.{AccumulatorParam, SparkConf, SparkContext}

object listAccumulatorParam extends AccumulatorParam[List[Double]] {
    def zero(initialValue: List[Double]): List[Double] = {
      Nil
  }
  def addInPlace(v1: List[Double], v2: List[Double]): List[Double] = {
    v1:::v2
  }
}

object broadCastTest {


  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("broadCastTest").setMaster("local")
    val sc = new SparkContext(conf)

    val myAccumulator = sc.accumulator[List[Double]](List(0.1,0.2,0.3))(listAccumulatorParam)


    println("my accumulator is "+myAccumulator.value)
    //my accumulator is List(0.1, 0.2, 0.3)

    sc.stop()
  }

}

SparkStreaming中應用Accumulators和Broadcast

通過對一些特有的字串廣播，然後進行過濾，比如我們可以把一些人的名字給過濾掉，也就是黑名單的過濾，如下實現過濾三個字串 a,b,c.從下面的資料中每秒產生一個字母

a
b
c
d
e
f
g
h
i

過濾的SparkStreaming程式如下：

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{Accumulator, SparkConf}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.{Seconds, StreamingContext}

object broadCastTest {
  @volatile private var broadcastValue: Broadcast[Seq[String]] = null
  @volatile private var accumulatorValue:Accumulator[Int] = null

  def main(args: Array[String]) {
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
    val conf = new SparkConf().setAppName("broadCastTest").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(2))

    broadcastValue = ssc.sparkContext.broadcast(Seq("a","b","c"))
    accumulatorValue = ssc.sparkContext.accumulator(0, "OnlineBlacklistCounter")

    val linesData = ssc.socketTextStream("master",9999)
    val wordCount = linesData.map(x =>(x,1)).reduceByKey(_+_)

 
    val counts = wordCount.filter{ case (word,count) =>
        if(broadcastValue.value.contains(word)){
          accumulatorValue += count
          //println("have blocked "+accumulatorValue+" times")
          false
        }else{
          //println("have blocked "+accumulatorValue+" times")
          true
        }
    }
    //println("broadcastValue:"+broadcastValue.value)
    counts.print()
    //wordCount.print()

    ssc.start()
    ssc.awaitTermination()


  }

}

得到的結果如下：

SparkStreaming之Accumulators和Broadcast

1、Accumulators和Broadcast基礎理解共享變數共享變數目的是將一個變數快取在每臺機器上，而不用在任務之間傳遞。在SparkCore中經常廣播一些環境變數，目的是使得在同一時間叢集中的每臺機器的環境變數都更新。它的功能是用於有效地給每個節點輸入一個環境

Flink與SparkStreaming之Counters& Accumulators累加器雙向應用案例實戰-Flink牛刀小試

版權宣告：本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。版權宣告：禁止轉載，歡迎學習。QQ郵箱地址：[email protected]，如有任何問題，可隨時聯絡。 1 累加器應用場

Spark之廣播變數Broadcast Variables與計數器Accumulators

一、廣播變數Broadcast Variables 根據官方文件，廣播變數Broadcast Variables可以使開發者在每個節點–即Executor上快取一個只讀的變數，它相對於在每個task上覆制一份這個變數具有更好的優勢。因為它能減少網路和記憶體的開銷。例如，有一個Map資料，

Angular $scope和$rootScope事件機制之$emit、$broadcast和$on

歡迎大家到我的部落格檢視本文對應內容，關注和交流對於Angular和Ionic學習的理解。 Angular按照釋出/訂閱模式設計了其事件系統，使用時需要“釋出”事件，並在適當的位置“訂閱”或“退訂”事件，就像郵箱裡面大量的訂閱郵件一樣，當我們不需要時就

spark記錄（19）SparkStreaming之從kafkaBroker和zookeeper獲取offset，和使用zookeeper管理offset

col ext js ryu 配置 map readv meta gdi rgs 一、從kafkaBroker獲取offset /** * 測試之前需要啟動kafka * @author root * */ public class GetTopic

Linux基礎優化之SElinux和iptables項

Python 基礎之列表和元組

bcp rtt fcn ott emd lns swe 二維數組 ttf list Python內置的一種數據類型是列表：list。list是一種有序的集合，可以隨時添加和刪除其中的元素。比如，列出班裏所有同學的名字，就可以用一個list表示： >>

清空文件內容命令之echo和/dev/null區別？

linux 字符串 null 定向我們知道清空文件內容有很多種方法我們只談論echo "" 和 /dev/null這兩種有什麽區別一、黑洞設備/dev/null是什麽？在 Linux 中， null 設備基本上被用來丟棄某個進程不再需要的輸出流，或者作為某個輸入流的空白文件，這些通常可以利用

ActiveMQ（18）：Message之延遲和定時消息投遞

jms activemq 延遲和定時消息投遞一、簡介延遲和定時消息投遞（Delay and Schedule Message Delivery）有時候我們不希望消息馬上被broker投遞出去，而是想要消息60秒以後發給消費者，或者我們想讓消息沒隔一定時間投遞一次，一共投遞指定的次數。。。

單元測試之Stub和Mock

下載我們並且試用 sample 註入 mes oge new 單元測試之Stub和Mock FROM：http://www.cnblogs.com/TankXiao/archive/2012/03/06/2366073.html 在做單元測試的時候，我們會發現我

ORACLE 內置函數之GREATEST和LEAST

表達式 lec 函數實現 rom 返回值常用 rac null oracl Oracle比較一列的最大值或者最小值，我們會不假思索地用MAX和MIN函數，但是對於比較一行的最大值或最小值呢？是不是日常用的少，很多人都不知道有ORACLE也有內置函數實現這個功能：COAL

python 之列表和元組

ott ase dig pdb awb data whl w3g htc list Python內置的一種數據類型是列表：list。list是一種有序的集合，可以隨時添加和刪除其中的元素。比如，列出班裏所有同學的名字，就可以用一個list表示： >>

linux權限之su和sudo的差別

進行 admins 是否 roo lai 表示 sudoers 還記得平時我們都知道非常多的文件都僅僅有root有權限來改動，那麽在我們平時的開發過程中都建議使用一般賬號來登錄進行開發。還記得前面說到的ssh嗎。我們也是將同意root登錄設置成no。

Java進擊C#——應用開發之Linq和EF

了吧 -1 擴展有一點增刪改 adk 對象 structure mis 本章簡言上一章筆者對於WinForm開發過程用到的幾個知識點做了講解。筆者們可以以此為開端進行學習。而本章我們來講一個跟ORM思想有關的知識點。在講之前讓我們想一下關於JAVA的hib

UGUI之Canvas和EventSystem

event 開啟 water phi 根據 mod jsb ngui 控件先介紹一下UGUI必不可缺的兩個組件：Canvas和EventSystem 事實上在場景中第一次創建UGUI控件的時候，這兩個物體都會自動添加到場景中，當然，必不可缺的不是這兩個物體，而是他們身上

Linux系統固定磁盤標識符之wwid和uuid

cat 操作系統包含掛載創建文件系統 main fin etc swap 背景描述，在Linux系統中，如果添加了新的SCSI磁盤或者映射SAN存儲LUN操作，重啟操作系統之後會出現磁盤標識符（sd*）錯亂的情況。例如之前添加的SAN存儲LUN的磁盤標識符為/dev

android 之 Intent、broadcast

@override tco broadcast ren final 生成 manage draw ets Intent的功能有：在mainActivity中為按鈕1添加監聽事件： listener1 = new OnClickListener() { @Ove

Android 基於Netty的消息推送方案之概念和工作原理(二)

img b2c 決定 watermark net nios 通道感覺 art 上一篇文章中我講述了關於消息推送的方案以及一個基於Netty實現的一個簡單的Hello World。為了更好的理解Hello World中的代碼，今天我來解說一下關於Netty中一些概念和工

Knockoutjs之observable和applyBindings的使用

func 寫法 ngs 關聯瀏覽器進行 del https text observable在Knockoutjs中屬於一個核心功能，在做監控數據的改變時，必須要用到Knockoutjs的監控屬性——observable。 ko.observab

數據庫優化技巧之in和not in

寫法 exists art null data class -s ack size 在編寫SQL語句時，假設要實現一張表有而另外一張表沒有的數據時。通常第一直覺的寫法是： select * from table1 where table1.id not in(s

SparkStreaming之Accumulators和Broadcast

相關推薦