Spark中常見join操作
阿新 • • 發佈:2019-02-15
spark中的連線操作
(1)join
如果熟悉sql的同學應該很熟悉join,這裡的join和sql中的inner join操作很相似,返回結果是前面一個集合和後面一個集合中匹配成功的,過濾掉關聯不上的。
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.
具體實際操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.join(b)
c.foreach(println)
//列印結果出來如下:
//(2,(8.0,8.0))
//(1,(4.0,2.0))
//這裡返回的結果很顯然是3匹配不到過濾掉,合併匹配到。
(2)leftOuterJoin
leftOuterJoin類似於SQL中的左外關聯left outer join,返回結果以第一個RDD為主,關聯不上的記錄為空。
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the output using the existing partitioner/parallelism level.
具體實際操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.leftOuterJoin(b)
c.foreach(println)
//列印結果出來如下:
//(2,(8.0,Some(8.0)))
//(3,(9.0,None))
//(1,(4.0,Some(2.0)))
(3)rightOuterJoin
rightOuterJoin類似於SQL中的有外關聯right outer join,返回結果以引數也就是第二個RDD為主,關聯不上的記錄為空
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Hash-partitions the resulting RDD using the existing partitioner/parallelism level.
具體實際操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.rightOuterJoin(b)
c.foreach(println)
//列印結果出來如下:
//(2,(Some(8.0),8.0))
//(1,(Some(4.0),2.0))