Spark 學習（6）

阿新 • • 發佈：2018-12-09

SparkStream

邏輯
- 當ssc啟動之後，driver會執行一個長時間執行的Task
- 作為Reveiver的executors，接受傳來的資料收到資料，並將其分成塊儲存在記憶體中
- 這寫塊也會被賦值給另一個Executors，以免資料丟失
- 每個批次間隔（通常這是每1秒），驅動程式將啟動Spark任務處理塊。然後這些塊被持久化到任意數量的目標中
- SparkSteam 與 Tcp 對接每一秒會請求一次連結然後返回資料而不是建立一個長連線
- 資料儲存，包括雲端儲存（例如S3、WASB等），關係資料儲存（例如，MySQL、PostgreSQL等等）NoSQL儲存。
DStreams
- 一組隨著時間到底的資料序列，代表著每一個時間段內到達的RDDs序列
Windowed transformations
- 每一次資料計算的視窗
checkpointing
- 目的是將資料儲存在可靠的檔案系統中，如HDFS
```
    ssc.checkpoint("hdfs://...")
```

Load

SparkContext

    # Create a local SparkContext and Streaming Contexts
    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext

     # Create 
 sc with two working threads
    sc = SparkContext("local[2]", "NetworkWordCount")     #　一定要寫local[2]　而不是local

    # Create local StreamingContextwith batch interval of 1 second
    ssc = StreamingContext(sc, 1)

    # Create checkpoint for local StreamingContext
    # 把資料直接存在checkpoint　資料夾下
    # ssc.checkpoint("checkpoint" 
) 

    # Create DStream that connects to localhost:9999
    # 使用 tcp 去連結遠方服務端 並且接受資料
    lines = ssc.socketTextStream("localhost", 9999)  
    # Split lines into words
    words = lines.flatMap(lambda line: line.split(" "))

    # Count each word in each batch
    pairs = words.map(lambda word: (word, 1))
    wordCounts = pairs.reduceByKey(lambda x, y: x + y)

    # Print the first ten elements of each RDD in this DStream
    wordCounts.pprint()

    # Populate `meetup_stream` table
    #sqlContext.sql("insert into meetup_stream select * from meetup_stream_
    #json")


    # Start the computation
    ssc.start()
    # Wait for the computation to terminate
    ssc.awaitTermination() 


    #控制檯使用　nc -l -p 9999 　當 nc -lk 9999　不能使用時

Sparksession

    """
        與之前的指令碼不同，現在使用的是更熟悉的指令碼
        先建立一個session 這裡不需要再去建立一個SparkStreaming
    """
    # Import the necessary classes and create a local SparkSession
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import explode
    from pyspark.sql.functions import split
    spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()

    """
        SparkStream 通過在第4行呼叫readStream來發起的
    """
    # Create DataFrame representing the stream of input lines
    # from connection to localhost:9999
    lines = spark\
    .readStream\
    .format('socket')\
    .option('host', 'localhost')\
     .option('port', 9999)\
    .load()

    """
        在這裡就不需要使用RDD的複雜操作，
        直接使用SQL便可以
    """
    # Split the lines into words
    words = lines.select(
    explode(
        split(lines.value, ' ')
    ).alias('word')
    )

    # Generate running word count
    wordCounts = words.groupBy('word').count()

    """
        沒有使用pprint（），而是顯式地呼叫writeStream來編寫
        流，並定義格式和輸出模式。雖然寫的時間要長一些，
        這些方法和屬性在語法上與其他DataFrame呼叫相似
        你只需要改變outputMode和格式屬性來儲存它
        對於資料庫、檔案系統、控制檯等等。
    """
    # Start running the query that prints the
    # running counts to the console
    query = wordCounts\
    .writeStream\
    .outputMode('complete')\
    .format('console')\
    .start()

    """
        最後，執行等待取消這個流媒體工作。
    """
    # Await Spark Streaming termination
    query.awaitTermination()

Save to text

    ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")

Example

  # Create a local SparkContext and Streaming Contexts
  from pyspark import SparkContext
  from pyspark import SparkConf


  from pyspark.streaming import StreamingContext
  import datetime
  # from pyspark.sql import SQLContext
  # from pyspark.sql.types import *
  import numpy as np

  # Create sc with four working threads

  sconf = SparkConf()
  sconf.setMaster("local[4]")

  sc = SparkContext(appName="act_analysis",conf = sconf)

  # Create local StreamingContextwith batch interval of 1 second
  ssc = StreamingContext(sc, 3)


  # Create DStream that connects to localhost:9999
  lines = ssc.socketTextStream("192.168.14.2",1234) 

  def get_rows(x):
      res = x.split(" ")
      return (res[0],res[1])

  rows = lines.map(get_rows)

  roles = rows.groupByKey()

  def get_speed_std(action):
      temp_list = []
      temp = 0
      for i in action:
          if temp == 0:
              temp_list.append(0)
              temp = i
              continue
          temp_list.append((i - temp))
          temp = i
      res = np.std(np.array(temp_list), ddof=1)
      return res

  def get_role_action_feature(action):
      try:
          action = sorted(action)
      except Exception as e:
          print(e)
          return None
      count = len(action)
      stay_time = int(action[-1] - action[0])
      try:
          ave_speed = count / stay_time
          std_speed = get_speed_std(action)
      except:
          ave_speed = 0
          std_speed = 0
      return ave_speed, std_speed

  def feature(x):
      role = x[0]
      action = [float(i)for i in list(x[1])]
      res = get_role_action_feature(action)

      return (role,res[0],res[1],datetime.datetime.now())

  res = roles.map(feature)

  # res.pprint()

  from pymongo import MongoClient

  MONGODB = {
      'MONGO_HOST': '',
      'MONGO_PORT': '27017',
      'MONGO_USERNAME': '',
      'MONGO_PASSWORD': ''
  }


  mongo_uri = 'mongodb://{account}{host}:{port}/'.format(
              account='{username}:{password}@'.format(
                  username=MONGODB['MONGO_USERNAME'],
                  password=MONGODB['MONGO_PASSWORD']) if MONGODB['MONGO_USERNAME'] else '',
              host=MONGODB['MONGO_HOST'],
              port=MONGODB['MONGO_PORT'])
  conn = MongoClient(mongo_uri)
  db = conn['104']

  def save(x):

      col = db['act_live_res']
      res = x.collect()
      for i in res:
          item = {
              'role_id' : i[0],
              'ave':i[1],
              'std':i[2],
              'date':i[3]
          }

          col.insert_one(item)

  #     conn.close()    

  res.foreachRDD(save)

  ssc.start()
  # Wait for the computation to terminate
  ssc.awaitTermination()

Spark 學習（6）

SparkStream 邏輯當ssc啟動之後，driver會執行一個長時間執行的Task 作為Reveiver的executors，接受傳來的資料收到資料，並將其分成塊儲存在記憶體中這寫塊也會被賦值給另一個Executors，以免資料丟失每個

linux命令學習（6）：ps命令

bytes 釋放 ice cti width kthread hellip 名稱 pts Linux中的ps命令是Process Status的縮寫。ps命令用來列出系統中當前運行的那些進程。ps命令列出的是當前那些進程的快照，就是執行ps命令的那個時刻的那些進程，如果想要

構建之法學習（6）

客戶需求現在保持變化經理論證規格沒有本周學習的是第六章——敏捷流程在軟件工程的語境裏，“敏捷流程”是一系列價值觀和方法論的集合。從2001年開始，一些軟件界的專家開始倡導“敏捷”的價值觀和流程，他們肯定了流行做法的價值，但是強調敏捷的做法更能帶來價值。

spark中flatMap函數用法--spark學習（基礎）

比較一次 ica 例子 tail details word fix spark spark中flatMap函數用法--spark學習（基礎）在spark中map函數和flatMap函數是兩個比較常用的函數。其中 map：對集合中每個元素進行操作。 fl

maven--學習（6）--MVN命令

arc rgs 測試報告額外 class sna osi tro 反向 Maven庫： http://repo2.maven.org/maven2/ Maven依賴查詢： http://mvnrepository.com/ 一，Maven常用命令： 1. 創建Mave

java===java基礎學習（6）---流程控制，for，if，switch，continue，break

nbsp int exception pub ase nio 內部註意點多重循環註意點： for循環的用法和python截然不同，註意格式 switch~，switch對應的case每當執行完畢都要break，由於基本不怎麽用switch，所以作為了解。中斷流程

perl學習（6）控制語句

OS body 條件 continue 其他 reac 控制 log int 1：　　unless 條件為假時，執行指定的語句　　unless…（條件為假執行）…else…（條件為真執行）… 2：　　until 　　　循環體一直執行，直到條件為真結束　　until ($

spark學習（1）--ubuntu14.04集群搭建、配置（jdk）

RM int 5.0 java_home 輸入 str cas Go 比較環境：ubuntu14.04 1、文本模式桌面模式切換 ctrl+alt+F6 切換到文本模式 ctrl + alt +F7 /輸入命令startx切換到桌面模式 2、更改Ip地址、主機名 /

Spark學習（二）——RDD的設計與運行原理

center data 創建組成分享圖片 img medium 列操作信息 Spark的核心是建立在統一的抽象RDD之上，使得Spark的各個組件可以無縫進行集成，在同一個應用程序中完成大數據計算任務。RDD的設計理念源自AMP實驗室發表的論文《Resilient

區塊鏈學習（6）區塊鏈

有序打包成 info 運算區塊鏈 ash 互連包含 hash 寫了幾篇區塊鏈的學習筆記，今天來寫寫比特幣中的區塊鏈。比特幣中區塊鏈是由包含交易信息的區塊從後向前有序鏈接起來的數據結構。每個區塊從後向前有序地鏈接在這個鏈條裏，每個區塊都指向前一個區塊。區塊結構區塊是

MyBatis學習（6）

throws 垃圾回收器安全 cep 正整數 sin ret 關系行刷新本視頻觀看地址：https://edu.51cto.com/sd/3ec2c 1、緩存 1.1、緩存的意義將用戶經常查詢的數據放在緩存（內存）中，用戶去查詢數據就不用從磁盤上(關系型數據庫數據

Arduino學習（6）

本文介紹Arduino連線並控制步進電機。連線方式：程式碼： #define ROTATE(x) {PORTD|=x; PORTD&=(x|0x0F);} //四相單四拍 const char SinBeat[4]={0x80,0x40,0x20,0x10}; /

HTML的學習（6）

表頭的樣式雖然表格已經初具雛形，但是樣式單一，我們已經添加了一些樣式表，以使它有點更容易閱讀。這個就是收尾工作。可以隨意的去加入任何的style屬性，你會在CSS課程期間學到更多關於這些東西。如果你想新增多個樣式，你只需要用;分號分隔開就行。 <th style="font-si

spring學習（6）

mob 常用 scope rda 之間出現異常類對象介紹資料 1 spring概念（1）spring核心兩部分（2）spring一站式框架（3）spring版本可以使用基本的javaBean代替EJB，EJB是重量級框架。 1 spring是一個開源的輕量級

Python學習（6）——面向物件編輯

1、類和例項（1）通過定義一個特殊的__init__方法，在建立例項的時候，就把相關屬性綁上去（2）普通的函式相比，在類中定義的函式第一個引數永遠是例項變數self，並且，呼叫時，不用傳遞該引數（3）和靜態語言不同，Python允許對例項變數繫結任何資料，也就是說，對於兩個例項

mybatis學習（6）：快取原理詳解

一、快取原理圖二、快取原理一級快取（本地快取） sqlSession級別的快取，一級快取是一直開啟的； SqlSession級別的一個Map &nb

機器學習（6）K近鄰演算法

k-近鄰，通過離你最近的來判斷你的類別例子：定義：如果一個樣本在特徵空間中的k個最相似（即特徵空間中最鄰近的樣本中大多數屬於某一類別），則該樣本屬於這個類別 K近鄰需要做標準化處理例如： import numpy as npimport pandas as pdfrom mat

memcached的學習（6）

2018.6.13 找到了一篇部落格，全面解析memcached的知識點，主要從伺服器端的記憶體原理，客戶端的分散式原理，以及擴充套件方向來介紹的，下面對這幾個方面做一個總結： 1、理解memcached的記憶體儲存我是mixi株式會社研究開發組的前阪徹。上次的文章介紹了memc

JAVAWEB學習（6） — Cookie

Cookie 1. 狀態管理 1.1 什麼是狀態管理將瀏覽器與web伺服器之間多次互動當做一個整體來處理，並且將多次互動所涉及的資料（即狀態）儲存下來。 1.2 如何進行狀態管理將狀態儲存在瀏覽器端（Cookie）將狀態儲存在伺服器端（Session）

tensorflow學習（6）：CNN必備函式tf.nn.conv2d和tf.nn.max_pool

一、卷積函式tf.nn.conv2d tf.nn.conv2d( input, filter, strides, padding, use_cudnn_on_gpu=None, name=None) 除去name引數用以指定該操作的name，與方法有關的一共五個引數：第一個引數in

Spark 學習（6）

SparkStream

相關推薦