如何避免spark dataframe的JOIN操作之後產生重複列(Reference '***' is ambiguous問題解決)
spark datafrme提供了強大的JOIN操作。
但是在操作的時候,經常發現會碰到重複列的問題。如下:
如分別建立兩個DF,其結果如下:
val df = sc.parallelize(Array(
("one", "A", 1), ("one", "B", 2), ("two", "A", 3), ("two", "B", 4)
)).toDF("key1", "key2", "value")
df.show()
+----+----+-----+
|key1|key2|value|
+----+----+-----+
| one| A| 1|
| one| B| 2|
| two| A| 3|
| two| B| 4|
+----+----+-----+
val df2 = sc.parallelize(Array(
("one", "A", 5), ("two", "A", 6)
)).toDF("key1", "key2", "value2")
df2.show()
+----+----+------+
|key1|key2|value2|
+----+----+------+
| one| A| 5|
| two| A| 6|
+----+----+------+
對其進行JOIN操作之後,發現多產生了KEY1和KEY2這樣的兩個欄位。
val joined = df.join(df2, df("key1") === df2("key1") && df("key2") === df2("key2"), "left_outer")
joined.show()
+----+----+-----+----+----+------+
|key1|key2|value|key1|key2|value2|
+----+----+-----+----+----+------+
| two| A| 3| two| A| 6|
| two| B| 4|null|null| null|
| one| A| 1| one| A| 5|
| one| B| 2|null|null| null|
+----+----+-----+----+----+------+
假如這兩個欄位同時存在,那麼就會報錯,如下:org.apache.spark.sql.AnalysisException: Reference 'key2' is ambiguous
因此,網上有很多關於如何在JOIN之後刪除列的,後來經過仔細查詢,才發現通過修改JOIN的表示式,完全可以避免這個問題。而且非常簡單。主要是通過Seq這個物件來實現。
df.join(df2, Seq("key1", "key2"), "left_outer").show()
+----+----+-----+------+
|key1|key2|value|value2|
+----+----+-----+------+
| two| A| 3| 6|
| two| B| 4| null|
| one| A| 1| 5|
| one| B| 2| null|
+----+----+-----+------+
通過實踐,完全成功!