[sparkSQL][union]關於union相關的使用記錄，奇怪的去重方法增加了

阿新 • • 發佈：2021-12-01

sql 中 Union相關說明

結論

Union & Union all：

功能：將兩個要連線的 SQL 語句拼接在一起，

要求：欄位個數一樣（強制），欄位型別一致（非強制）int→double→string

輸出：以第一個表的列名作為輸出表的列名

區別：union會對拼接結果去重，union all 全部保留

>>> # check num of column
********select * from table1********
+---+------+------+
| id|score1|score2|
+---+------+------+
|  a|     1|     2|
|  a|     1|     2|
|  b|     1|     2|
|  b|     2|     3|
+---+------+------+
********select * from table2********
+---+------+------+------+
| id|score1|score2|score3|
+---+------+------+------+
|  a|     1|     2|     3|
|  b|     2|     3|     4|
+---+------+------+------+
>>> df3 = spark.sql(
    """
    select *
    from table1
    union
    select *
    from table2
    """
    )
>>> # Union can only be performed on tables with the same number of columns, but the first table has 3 columns and the second table has 4 columns

注：以int→double作為演示，string 就不演示了。

>>> # check type of column
>>> int→double→string
>>> df1.show()
+---+------+------+
| id|score1|score2|
+---+------+------+
|  a|     1|     2|
|  a|     1|     2|
|  b|     1|     2|
|  b|     2|     3|
+---+------+------+
>>> df2.show()
+---+------+------+
| id|score1|score2|
+---+------+------+
|  a|   1.0|   2.0|
|  b|   2.0|   3.0|
+---+------+------+
>>> df1.createOrReplaceTempView('table1')
>>> df2.createOrReplaceTempView('table2')
>>> df3 = spark.sql(
    """
    select *
    from table1
    union
    select *
    from table2
    """
    )
>>> df3.show()
+---+------+------+
| id|score1|score2|
+---+------+------+
|  a|   1.0|   2.0|
|  b|   2.0|   3.0|
|  b|   1.0|   2.0|
+---+------+------+
>>> print(df1)
>>> DataFrame[id: string, score1: bigint, score2: bigint]
>>> print(df2)
>>> DataFrame[id: string, score1: double, score2: double]
>>> print(df3)
>>> DataFrame[id: string, score1: double, score2: double]

關於self-Union，奇怪的去重方式增加了

先看一段SQL

select *
from table1
union
select *
from table1

思考一下這段程式碼有沒有用？

答案是有用的

實際上union操作並不會記錄資料的來源，拼接完成後的資料表也是亂序的，table1 union table2在去重的時候：

並不是“選定table1，對比table2中的資料是否在table1中出現，如果出現，去除，如果未出現，保留，拼接為新表”

而是“彙總table1與table2表中的資料，對彙總後的資料進行去重”

所以下面兩段SQL的作用是相同的（但是效率方面應該是不同的，這部分我沒有進行驗證）

select distinct *
from table1

--------------------------

select *
from table1
union
select *
from table1

為什麼會注意到union操作的問題

select distinct *
from sample1
union
select distinct *
from sample2
union
select distinct *
from sample3
-- 以上是我的寫法
----------------------------------------
-- 以下是別人的建議（提高SQL的執行效率）
select *
from sample1
union
select *
from sample2
union
select *
from sample3