spark scala 刪除所有列全為空值的行

阿新 • • 發佈：2021-01-07

刪除表中全部為NaN的行

df.na.drop("all")

刪除表任一列中有NaN的行

df.na.drop("any")

示例:

scala> df.show
+----+-------+--------+-------------------+-----+----------+
|  id|zipcode|    type|               city|state|population|
+----+-------+--------+-------------------+-----+----------+
|   1|    704|STANDARD|               null 
|   PR|     30100|
|   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|   5|  76177|STANDARD|               null|   TX|      null|
|null|   null|    null|               null| null 
|      null|
|   7|  76179|STANDARD|               null|   TX|      null|
+----+-------+--------+-------------------+-----+----------+


scala> df.na.drop("all").show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1 
|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
|  7|  76179|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+


scala> df.na.drop().show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+


scala> df.na.drop("any").show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+

刪除給定列為Null的行:

val nameArray = sparkEnv.sc.textFile("/master/abc.txt").collect()
val df = df.na.drop("all", nameArray.toList.toArray)

df.na.drop(Seq("population","type"))

函式原型:

def drop(): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values.

def drop(how: String): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values.
If how is "any", then drop rows containing any null or NaN values. If how is "all", then drop rows only if every column is null or NaN for that row.

def drop(how: String, cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(how: String, cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

def drop(cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

更多函式原型:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

參考:
N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
示例程式碼及資料集:https://github.com/spark-examples/spark-scala-examples csv路徑:src/main/resources/small_zipcode.csv
https://www.jianshu.com/p/39852729736a