spark sql 不等值 join
阿新 • • 發佈:2018-09-06
匹配 rod 日期 變更 star 牛奶 發生 spark art products一個商品價格變化的表,orders商品訂單,記錄每次購買商品和日期
基於Spark SQL中的不等值join實現orders和products的匹配,統計每個訂單中商品對應當時的價格
基於Spark SQL中的不等值join實現orders和products的匹配,統計每個訂單中商品對應當時的價格
緩慢變化的商品價格表
旺仔牛奶,發生過一次價格變更
scala> val products = sc.parallelize(Array( | ("旺仔牛奶", "2017-01-01", "2018-01-01", 4), | ("旺仔牛奶", "2018-01-02", "2020-01-01", 5), | ("王老吉", "2017-01-02", "2019-01-01", 5), | ("衛龍辣條", "2010-01-01", "2020-01-01", 2) | )).toDF("name", "startDate", "endDate", "price") products: org.apache.spark.sql.DataFrame = [name: string, startDate: string ... 2 more fields] scala> products.show(); +----+----------+----------+-----+ |name| startDate| endDate|price| +----+----------+----------+-----+ |旺仔牛奶|2017-01-01|2018-01-01| 4| |旺仔牛奶|2018-01-02|2020-01-01| 5| | 王老吉|2017-01-02|2019-01-01| 5| |衛龍辣條|2010-01-01|2020-01-01| 2| +----+----------+----------+-----+
訂單表(商品名稱,訂單日期)
旺仔牛奶在不同價格時段分別發生了一次訂單
scala> val orders = sc.parallelize(Array( | ("2017-06-01", "旺仔牛奶"), | ("2017-07-01", "王老吉"), | ("2018-03-01", "旺仔牛奶") | )).toDF("date", "product") orders: org.apache.spark.sql.DataFrame = [date: string, product: string] scala> orders.show +----------+-------+ | date|product| +----------+-------+ |2017-06-01|旺仔牛奶| |2017-07-01| 王老吉| |2018-03-01|旺仔牛奶| +----------+-------+
通過不等值連接,計算每個訂單當時的商品價格
查看出旺仔牛奶,兩個訂單在不同時間段上對應的價格
scala> orders.join(products, $"product" === $"name" && $"date" >= $"startDate" && $"date" <= $"endDate").show() +-----------+------------+----------+------------+-------------+-----+ | date | product | name | startDate | endDate | price| +-----------+------------+----------+------------+-------------+-----+ |2017-07-01| 王老吉 | 王老吉 |2017-01-02|2019-01-01 | 5 | |2017-06-01| 旺仔牛奶 |旺仔牛奶|2017-01-01|2018-01-01 | 4 | |2018-03-01| 旺仔牛奶 |旺仔牛奶|2018-01-02|2020-01-01 | 5 | +-----------+------------+----------+------------+-------------+-----+
spark sql 不等值 join