[解決辦法] Invalid PythonUDF (), requires attributes from more than one child.
阿新 • • 發佈:2018-12-15
[解決辦法] Invalid PythonUDF (), requires attributes from more than one child.
報題中的錯誤,解決辦法:在過濾過程前 加 df.cache()
(這裡的 df 為過濾的 DataFrame)
The sequence of steps that causes this are:
join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF <lambda>(b#1L, c#6L), requires attributes from more than one child.
Here are some minimum steps to reproduce this issue in pyspark
from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect()
doing extra.cache() before the filtering will fix the issue but obviously isn’t a solution.