1. 程式人生 > >spark 從RDD createDataFrame 的坑

spark 從RDD createDataFrame 的坑

apach reat pac class pyspark 數據集 data highlight true

Scala:

import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

Python:

from pyspark.ml.linalg import Vectors

df 
= spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])
如果是pair rdd則:

    stratified_CV_data = training_data.union(test_data) #pair rdd
    #schema = StructType([
# StructField("label", IntegerType(), True), # StructField("features", VectorUDT(), True)]) vectorized_CV_data = sqlContext.createDataFrame(stratified_CV_data, ["label", "features"]) #,schema)

因為spark交叉驗證的數據集必須是data frame,也是醉了!

spark 從RDD createDataFrame 的坑