Spark開發-SparkSQL讀寫資料

阿新 • • 發佈：2021-01-08

SparkSQL資料讀寫

 DataFrameReader
 DataFrameWriter
   DataFrameReader 對應的元件 SCHEMA  OPTION   FORMAT
    DataFrameReader 有兩種訪問方式, 
	      一種是使用 load 方法載入, 使用 format 指定載入格式, 
	       還有一種是使用封裝方法, 類似 csv, json, jdbc 等

        //.第一種形式 READ + FORMAT +load讀取 
           spark.read
             .format("csv")
             .option("header",value=true)
             .option("inferSchema",value = true)
             .load("dataset/1231.csv")
        
           //2.第二種形式 使用具體檔案型別讀取 READ 
           spark.read
             .option("header",value=true)
             .option("inferSchema",value = true)
             .csv("dataset/1231.csv")
	DataFrameWriter 也有兩種使用方式, 一種是使用 format 配合 save, 
	   預設的 format 是 Parquet
	  還有一種是使用封裝方法, 例如 csv, json, saveAsTable 等
	    def parquet(path: String): Unit = { format("parquet").save(path)}
	    def csv    (path: String): Unit = { format("csv").save(path)}

DataFrameWriter

增量操作

使用spark做增量操作的時候, 
   insertInto 和 mode(SaveMode.Append).saveAsTable()

insertInto 
    insertInto 使用基於位置
     It requires that the schema of the `DataFrame` is the same as the schema of the table.
     Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based resolution
    
saveAsTable 
     基於列名 column_names-based
      有不同的模式，
	  如果是  SaveMode.Overwrite
       the schema of the `DataFrame` does not need to be the same as that of the existing table.
      如果是 `Append`, 
     if there is an existing table, we will use the format and options OF  the existing table. 
	 The column order in the schema of the `DataFrame` doesn't need to be same as that of the existing table. 
	 Unlike `insertInto`,
     `saveAsTable` will use the column names to find the correct column positions
  ####覆蓋
   report_info.write.insertInto(tableName="dwd.t_dwd_report_info_d",overwrite=True)
   DataFrameWriter 可以將資料儲存到 Hive 表中, 所以也可以指定分割槽和分桶資訊
	 
對儲存的
    SCHEMA
	檔案的覆蓋和追加

Spark的讀寫

  Spark can create distributed datasets from any storage source supported by Hadoop, 
        including your local file system, HDFS, Cassandra, HBase, Amazon S3, 
		        etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
寫
  saveAsTextFile  saveAsSequenceFile  saveAsSequenceFile

儲存問題

  儲存的載體
  儲存的資料格式
  儲存所使用的命令或函式
  存數遇到的問題
    小檔案問題
     spark.default.parallelism      在處理RDD時才會起作用，對SparkSql無效。
     spark.sql.shuffle.partitions   則是對sparks SQL專用的設定。

原始碼

org.apache.spark.sql
  Interface used to write a [[Dataset]] to external storage systems (e.g. file systems,key-value stores, etc). 
  Use `Dataset.write` to access this.
  使用了  // Builder pattern config options 構建者模式

參考

	https://github.com/apache/spark/pull/13013
	Spark寫入hive表時saveAsTable和insertInto的區別 https://blog.csdn.net/huihuixia123/article/details/107658299
	Spark（三）-- SparkSQL擴充套件（資料讀寫） -- 讀寫 Parquet、Json 格式檔案(二) https://blog.csdn.net/qq_18800463/article/details/101421490
	Spark SQL, DataFrames and Datasets Guide http://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
	RDD Programming Guide http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-programming-guide

Spark開發-SparkSQL讀寫資料

SparkSQL資料讀寫 DataFrameReader DataFrameWriter DataFrameReader 對應的元件 SCHEMAOPTIONFORMAT

spark-mongo(1 讀寫資料)

參考連結 MongoDB Connector for Spark官方文件 Mongo Spark 原始碼原文：https://www.jianshu.com/p/dbac491317cc

併發讀寫資料一致性保證（二）-MySQL

業務開發過程，其實就是使用者業務資料的處理過程，因而開發的核心任務就是維護資料一致不出錯。現實場景中，多個使用者會併發讀寫同一份資料（如秒殺），不加控制會翻車、加了控制則降低併發度，影響效能和使用者體

併發讀寫資料一致性保證（一）Java併發容器

淺談Python中的異常和JSON讀寫資料的實現

異常可以防止出現一些不友好的資訊返回給使用者，有助於提升程式的可用性，在java中通過try ... catch ... finally來處理異常，在Python中通過try ... except ... else來處理異常

使用FileStream讀寫資料

這節講一下使用FileStream讀寫資料，這是一個比較基礎的流。 FileStream類只能處理原始位元組，所以它可以處理任何型別的檔案。

gorm CRUD:讀寫資料

操作的引數傳入基本都是結構體的指標，同時完成CURD後，指標指向的結構體資料也將隨著更新

python讀寫資料讀寫csv檔案(pandas用法)

python中資料處理是比較方便的，經常用的就是讀寫檔案，提取資料等，本部落格主要介紹其中的一些用法。Pandas是一個強大的分析結構化資料的工具集;它的使用基礎是Numpy(提供高效能的矩陣運算);用於資料探勘和資料分析

《elasticsearch》之讀寫資料

　　　　 es 寫資料過程 1）客戶端選擇一個 node 傳送請求過去，這個 node 就是coordinating node（協調節點）。

Spark開發-SparkSQL引擎自定義配置

Spark catalyst的擴充套件 org.apache.spark.sql SparkSession Inject extensions into the [[SparkSession]].

關於.net環境下跨程序、高頻率讀寫資料的問題

一、需求背景 1、最近專案要求高頻次地讀寫資料，資料量也不是很大，多表總共加起來在百萬條上下。

HDFS讀寫資料流程

HDFS寫資料流程資料寫入流程客戶端向namenode請求上傳檔案，namenode檢查目標檔案是否已存在，父目錄是否存在。

kafka 為什麼能那麼快？高效讀寫資料，原來是這樣做到的

1. 利用 Partition 實現並行處理　　我們都知道 Kafka 是一個 Pub-Sub 的訊息系統，無論是釋出還是訂閱，都要指定 Topic。

HBase 讀寫資料流程

HBase 讀寫資料流程 HBase 讀資料流程客戶端建立 Connection 連線物件，通過載入的 hbase-site.xml 配置檔案獲得 zk 叢集地址

位元組流讀寫資料

位元組 InputStream是位元組輸入流的抽象類，是所有位元組輸入流的父類 InputStream類是用來處理位元組的，不適合處理文字，Java為字元文字的輸入提供了單獨的類---Reader類，但Reader類並不是InputStream類的替換者

任意資料長度的 bitset，以及按位讀寫資料的函式

如果沒有定義過 byte_t 型別： #include <stdint.h>typedef uint8_t byte_t;//typedef unsigned char byte_t; // 1 個 byte 位元組，8位，一般系統中能表示的最小位元組資料。

C# OpenProtocol 開放乙太網協議讀寫資料訂閱資料

主要使用的軟體是 HslCommunication 關於這個軟體的本身，詳細可以參考下面的地址：

iOS藍芽開發藍芽連線和資料讀寫

在做藍芽開發之前，最好先了解一些概念：服務（services）：藍芽外設對外廣播的必定會有一個服務，可能也有多個，服務下面包含著一些特徵，服務可以理解成一個模組的視窗；

spark-monog(2 資料讀寫)

原文：https://blog.csdn.net/u011098327/article/details/72865934 依賴： <dependency> <groupId>org.mongodb.spark</groupId>

初識spark 之讀寫MySql資料

需求 1、使用spark讀取MySql庫資料； 2、使用spark讀取MySql庫資料，並寫入另一張表。

Spark開發-SparkSQL讀寫資料

SparkSQL資料讀寫

DataFrameWriter

增量操作

Spark的讀寫

儲存問題

原始碼

參考

相關推薦