1. 程式人生 > 其它 >spark 計算前後兩條記錄之間的差(diff),時間差等

spark 計算前後兩條記錄之間的差(diff),時間差等

有時候會遇到這樣的場景:有一個datafram,我們需要計算同一組物件中,前後兩條記錄之間的差值,此處並不僅限於時間,還可以是其他的資料型別
需要用到兩個工具:spark視窗函式Window對物件分組以及lag函式

val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:02:00"),
    ("notebook", "2019-01-10 13:15:22"),
    ("small_phone", "2019-01-30 09:30:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:50:00"),
    ("small_phone", "2019-01-30 09:32:00"),
    ("big_phone", "2019-01-2 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")
val sessionWindow = Window.partitionBy("device").orderBy("purchase_time")
val diffDf = df.withColumn("pre_time",
                          functions.lag($"purchase_time",1).over(sessionWindow))
diffDf.show()
val minitesDf = diffDf.withColumn("purchase_time",
                                  functions.to_timestamp(col("purchase_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("pre_time",
                                 functions.to_timestamp(col("pre_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("minitues_diff",
                                  round((col("purchase_time").cast(LongType)-col("pre_time").cast(LongType))/60))
minitesDf.show()