1. 程式人生 > >[解決辦法] Invalid PythonUDF (), requires attributes from more than one child.

[解決辦法] Invalid PythonUDF (), requires attributes from more than one child.

[解決辦法] Invalid PythonUDF (), requires attributes from more than one child.

報題中的錯誤,解決辦法:在過濾過程前 加 df.cache() (這裡的 df 為過濾的 DataFrame)

The sequence of steps that causes this are:

join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF <lambda>(b#1L, c#6L), requires attributes from more than one child.

Here are some minimum steps to reproduce this issue in pyspark

from pyspark.sql import types
from pyspark.sql import functions as F
df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)])
df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)])
joined = df1.join(df2, df1['a'] == df2['a'])
extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c']))
filtered = extra.where(extra['sum'] < F.lit(10)).collect()

doing extra.cache() before the filtering will fix the issue but obviously isn’t a solution.