spark 計算前後兩條記錄之間的差(diff),時間差等
阿新 • • 發佈:2021-07-17
有時候會遇到這樣的場景:有一個datafram,我們需要計算同一組物件中,前後兩條記錄之間的差值,此處並不僅限於時間,還可以是其他的資料型別
需要用到兩個工具:spark視窗函式Window對物件分組以及lag函式
val df = Seq( ("notebook","2019-01-01 00:00:00"), ("notebook", "2019-01-10 13:02:00"), ("notebook", "2019-01-10 13:15:22"), ("small_phone", "2019-01-30 09:30:00"), ("small_phone", "2019-01-15 12:00:00"), ("small_phone", "2019-01-30 09:50:00"), ("small_phone", "2019-01-30 09:32:00"), ("big_phone", "2019-01-2 09:30:00") ).toDF("device", "purchase_time").sort("device","purchase_time")
val sessionWindow = Window.partitionBy("device").orderBy("purchase_time")
val diffDf = df.withColumn("pre_time",
functions.lag($"purchase_time",1).over(sessionWindow))
diffDf.show()
val minitesDf = diffDf.withColumn("purchase_time", functions.to_timestamp(col("purchase_time"),"yyyy-mm-dd HH:mm:ss")) .withColumn("pre_time", functions.to_timestamp(col("pre_time"),"yyyy-mm-dd HH:mm:ss")) .withColumn("minitues_diff", round((col("purchase_time").cast(LongType)-col("pre_time").cast(LongType))/60)) minitesDf.show()