【轉載】【必會】SparkStreaming的視窗操作及實戰

阿新 • • 發佈：2021-07-21

Window Operations（視窗操作）可以設定視窗大小和滑動視窗間隔來動態的獲取當前Streaming的狀態。基於視窗的操作會在一個比 StreamingContext 的 batchDuration（批次間隔）更長的時間範圍內，通過整合多個批次的結果，計算出整個視窗的結果。

下面，通過一張圖來描述SparkStreaming的視窗操作，如圖所示。

基於視窗的操作需要兩個引數，如下：

視窗長度(windowDuration)，控制每次計算最近的多少個批次的資料；
滑動間隔(slideDuration)，用來控制對新的 DStream 進行計算的間隔。

兩者都必須是 StreamingContext 中批次間隔(batchDuration)的整數倍。

使用視窗操作，即使用視窗操作進行實戰。

每秒傳送1個數字

package cn.lagou.streaming
import java.io.PrintWriter
import java.net.{ServerSocket, Socket}
object SocketLikeNCWithWindow {
    def main(args: Array[String]): Unit = {
        val port = 1521
        val ss = new ServerSocket(port)
        val socket: Socket = ss.accept()
        println("connect to host : " + socket.getInetAddress)
        var i = 0
        // 每秒傳送1個數
        while(true) {
            i += 1
            val out = new PrintWriter(socket.getOutputStream)
            out.println(i)
            out.flush()
            Thread.sleep(1000)
        }
    }
}

案例一

觀察視窗的資料；觀察 batchDuration、windowDuration、slideDuration 三者之間的關係；使用視窗相關的操作，具體程式碼演示如下：

package cn.lagou.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream,ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WindowDemo {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]")
                                  .setAppName(this.getClass.getCanonicalName)
        // 每 5s 生成一個RDD（mini-batch）
        val ssc = new StreamingContext(conf, Seconds(5))
        ssc.sparkContext.setLogLevel("error")
        val lines: ReceiverInputDStream[String] =
        ssc.socketTextStream("localhost", 1521)
        lines.foreachRDD{ (rdd, time) =>println(s"rdd = ${rdd.id}; time = $time")
            rdd.foreach(value => println(value))
        }
        val res1: DStream[String] =lines.reduceByWindow(_ + " " + _,Seconds(20), Seconds(10))
        res1.print()
        val res2: DStream[String] = lines.window(Seconds(20),Seconds(10))
        res2.print()
        // 求視窗元素的和
        val res3:DStream[Int]=lines.map(_.toInt).reduceByWindow(_+_,Seconds(20), Seconds(10))
        res3.print()
        // 求視窗元素的和
        val res4 = res2.map(_.toInt).reduce(_+_)
        res4.print()
        ssc.start()
        ssc.awaitTermination()
    }
}

案例二

熱點搜尋詞實時統計。每隔 10 秒，統計最近20秒的詞出現的次數，具體程式碼演示如下：

package cn.lagou.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream,ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object HotWordStats {
    def main(args: Array[String]): Unit = {
        val conf: SparkConf = new SparkConf().setMaster("local[2]")
                              .setAppName(this.getClass.getCanonicalName)
        val ssc = new StreamingContext(conf, Seconds(2))
        ssc.sparkContext.setLogLevel("ERROR")
        //設定檢查點，檢查點具有容錯機制。生產環境中應設定到HDFS
        ssc.checkpoint("data/checkpoint/")
        val lines: ReceiverInputDStream[String] =ssc.socketTextStream("localhost", 9999)
        val words: DStream[String] = lines.flatMap(_.split("\\s+"))
        val pairs: DStream[(String, Int)] = words.map(x => (x, 1))
        // 通過reduceByKeyAndWindow運算元, 每隔10秒統計最近20秒的詞出現的次數
        // 後 3個引數：視窗時間長度、滑動視窗時間、分割槽
        val wordCounts1: DStream[(String, Int)] =pairs.reduceByKeyAndWindow(
                                    (a: Int, b: Int) => a + b,Seconds(20),Seconds(10), 2)
        wordCounts1.print
        // 這裡需要checkpoint的支援
        val wordCounts2: DStream[(String, Int)] = pairs.reduceByKeyAndWindow(_ + _,_ - _,
                                                  Seconds(20),Seconds(10), 2)
        wordCounts2.print
        ssc.start()
        ssc.awaitTermination()
    }
}

【轉載】【必會】SparkStreaming的視窗操作及實戰

Window Operations（視窗操作）可以設定視窗大小和滑動視窗間隔來動態的獲取當前Streaming的狀態。基於視窗的操作會在一個比 StreamingContext 的 batchDuration（批次間隔）更長的時間範圍內，通過整合多個批次的結

【SQL必知必會】第2課檢索資料

2.1 檢索單個列 1 SELECT prod_name 2 FROM Products; View Code 1.分號隔開\";\"每一個語句 2.不區分大小寫，同時空格和空行都用作分割都忽略。

【SQL必知必會】第5課高階過濾資料

5.1 組合WHERE 子句 #5.1.1 AND操作符 1 SELECT prod_id, prod_price, prod_name 2 FROM Products 3 WHERE vend_id = \'DLL01\' AND prod_price <= 4;

【SQL必知必會】第6課用萬用字元進行過濾

6.1 LIKE 操作符 6.1.1 百分號 % 萬用字元檢索任意以Fish開頭的詞不過後面有多少。

【SQL必知必會】第9課彙總資料

9.1 聚類函式 9.1.1 AVG()函式計算產品的平均價格 1 -- 計算產品的平均價格 2 SELECT AVG(prod_price) AS avg_price

【SQL必知必會】第10課資料分組

Group By 1、每一列都有效（不能是聚集函式），Group by 後接相同表示式（不可是別名）

【SQL必知必會】第11課利用子查詢

11.1 子查詢　　簡單而言，子查詢就是迴圈巢狀查詢 11.2 利用子查詢進行過濾：

關於量化【轉載以學習、回憶】

關於量化還不明白轉載一篇以學習、回憶來源連結：https://zhuanlan.zhihu.com/p/86900556

學習率預熱linear warmup【轉載以學習、回憶】

來源連結：https://mp.weixin.qq.com/s?__biz=MzAxOTU5NTU4MQ==&mid=2247488715&idx=1&sn=2acedd3705b27e5b9e259485f2dd9653&chksm=9bc5fc99acb2758f82b638d03ace8993b5b7374c0162020032e61e144ed86