11.spark sql之RDD轉換DataSet

阿新 • • 發佈：2018-09-09

Once lds nco ldd 方法 att context gin statement

簡介

??Spark SQL提供了兩種方式用於將RDD轉換為Dataset。

使用反射機制推斷RDD的數據結構

??當spark應用可以推斷RDD數據結構時，可使用這種方式。這種基於反射的方法可以使代碼更簡潔有效。

通過編程接口構造一個數據結構，然後映射到RDD上

??當spark應用無法推斷RDD數據結構時，可使用這種方式。

反射方式

scala

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

java

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read()
  .textFile("examples/src/main/resources/people.txt")
  .javaRDD()
  .map(line -> {
    String[] parts = line.split(",");
    Person person = new Person();
    person.setName(parts[0]);
    person.setAge(Integer.parseInt(parts[1].trim()));
    return person;
  });

// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");

// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    stringEncoder);
teenagerNamesByIndexDF.show();
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"),
    stringEncoder);
teenagerNamesByFieldDF.show();
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

python

from pyspark.sql import Row

sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
    print(name)
# Name: Justin

編程方式

scala

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

java

import java.util.ArrayList;
import java.util.List;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext()
  .textFile("examples/src/main/resources/people.txt", 1)
  .toJavaRDD();

// The schema is encoded in a string
String schemaString = "name age";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
  StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
  fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
  String[] attributes = record.split(",");
  return RowFactory.create(attributes[0], attributes[1].trim());
});

// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);

// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");

// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    Encoders.STRING());
namesDS.show();
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

python

# Import data types
from pyspark.sql.types import *

sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")

results.show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

11.spark sql之RDD轉換DataSet

Once lds nco ldd 方法 att context gin statement 簡介 ??Spark SQL提供了兩種方式用於將RDD轉換為Dataset。使用反射機制推斷RDD的數據結構 ??當spark應用可以推斷RDD數據結構時，可使用這種方式。這種

Spark SQL中 RDD 轉換到 DataFrame

pre ase replace 推斷 expr context 利用反射轉換 port 1.people.txtsoyo8, 35小周, 30小華, 19soyo,882./** * Created by soyo on 17-10-10. * 利用反射機制推斷RDD

Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

一：解釋官網：https://spark.apache.org/docs/latest/sql-getting-started.html 這種場景是生活中的常態 When case classes cannot be defined ahead of time (for example

《深入理解Spark》之RDD轉換DataFrame的兩種方式的比較

package com.lyzx.day19 import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.{SparkConf, Spark

Spark 系列（八）—— Spark SQL 之 DataFrame 和 Dataset

## 一、Spark SQL簡介 Spark SQL 是 Spark 中的一個子模組，主要用於操作結構化資料。它具有以下特點： + 能夠將 SQL 查詢與 Spark 程式無縫混合，允許您使用 SQL 或 DataFrame API 對結構化資料進行查詢； + 支援多種開發語言； + 支援

《深入理解Spark》之RDD和DataFrame的相互轉換

package com.lyzx.day18 import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.s

spark基礎之RDD和DataFrame的轉換方式

一通過定義Case Class,使用反射推斷Schema 定義Case Class，在RDD的轉換過程中使用Case Class可以隱式轉換成SchemaRDD,然後再註冊成表，然後就可以利用sql

Spark-Sql之DataFrame實戰詳解

集合 case 編程方式優化所表 register 操作數 print ava 1、DataFrame簡介：在Spark中，DataFrame是一種以RDD為基礎的分布式數據據集，類似於傳統數據庫聽二維表格，DataFrame帶有Schema元信息，即DataFram

Spark SQL 之 Join 實現

結構很多找到過濾 sql查詢優化 ade read 轉換成分析原文地址：Spark SQL 之 Join 實現 Spark SQL 之 Join 實現塗小剛 2017-07-19 217標簽： spark ，數據庫 Join作為SQL中

spark筆記之RDD的緩存

process color RoCE 就是發現 mark 其他動作 blog Spark速度非常快的原因之一，就是在不同操作中可以在內存中持久化或者緩存數據集。當持久化某個RDD後，每一個節點都將把計算分區結果保存在內存中，對此RDD或衍生出的RDD進行的其他動作中重用

spark core之RDD編程

緩存 code 會有核心 hdf 機器 end action rdd spark提供了對數據的核心抽象——彈性分布式數據集（Resilient Distributed Dataset，簡稱RDD）。RDD是一個分布式的數據集合，數據可以跨越集群中的

spark筆記之RDD容錯機制之checkpoint

原理 chain for 機制方式方法相對例如 contex 10.checkpoint是什麽（1）、Spark 在生產環境下經常會面臨transformation的RDD非常多（例如一個Job中包含1萬個RDD）或者具體transformation的RDD本身計算

12.spark sql之讀寫數據

rcfile serializa fig jdbc連接 nco .sh nat 字段 jdb 簡介 ??Spark SQL支持多種結構化數據源，輕松從各種數據源中讀取Row對象。這些數據源包括Parquet、JSON、Hive表及關系型數據庫等。 ??當只使用一部分字段時，

10.spark sql之快速入門

繼續 ssi org 特性 splay ssa select 開始 pytho 前世今生 Hive&Shark ??隨著大數據時代的來臨，Hadoop風靡一時。為了使熟悉RDBMS但又不理解MapReduce的技術人員快速進行大數據開發，Hive應運而生。Hive是

Spark Streaming整合Spark SQL之wordcount案例

完整原始碼地址： https://github.com/apache/spark/blob/v2.3.2/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala 案例原

十三.Spark SQL之通過Zeppelin進行統計資料的展示

Spark SQL學習有一段時間了,因此花了一些時間寫了一個日誌清洗的專案,專案已經上傳到github上了, 專案地址感興趣的可以拉下來看看。在這裡我不講關於專案的實現過程,清洗之後的結果進行資料展示的時候,除了echarts框架,還發

十六.Spark SQL之讀取複雜的json資料

第一步.準備json資料 test.json {"name":"liguohui","nums":[1,2,3,4,5]} {"name":"zhangsan","nums":[6,7,8,9,10]} test2.json {"name":"Yin","ad

Spark學習之RDD

1、由圖可知每一個RDD由一系列partition組成。 2、例如將flatMap作用在每一個分割槽上，即父RDD作為flatMap的輸入，子RDD作為flatMap的輸出。 3、當一個partition內丟失，由於子RDD知道父RDD是誰，所以子RDD可以將函式再次作用在父RDD的partition上，重新

Spark-SQL之DataFrame操作大全

　　Spark SQL中的DataFrame類似於一張關係型資料表。在關係型資料庫中對單表或進行的查詢操作，在DataFrame中都可以通過呼叫其API介面來實現。可以參考，Scala提供的DataFrame AP

Spark修煉之道（高階篇）——Spark原始碼閱讀：第十三節 Spark SQL之SQLContext（一)

作者：周志湖 1. SQLContext的建立 SQLContext是Spark SQL進行結構化資料處理的入口，可以通過它進行DataFrame的建立及SQL的執行，其建立方式如下： //sc為SparkContext val sqlContext

11.spark sql之RDD轉換DataSet

反射方式

編程方式

相關推薦