spark呼叫類內方法
阿新 • • 發佈:2019-02-04
在pyspark中呼叫類方法,報錯
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
原因:
spark不允許在action或transformation中訪問SparkContext,如果你的action或transformation中引用了self,那麼spark會將整個物件進行序列化,並將其發到工作節點上,這其中就保留了SparkContext,即使沒有顯式的訪問它,它也會在閉包內被引用,所以會出錯。
解決:
應該將呼叫的類方法定義為靜態方法 @staticmethod
class model(object): @staticmethod def transformation_function(row): row = row.split(',') return row[0]+row[1] def __init__(self): self.data = sc.textFile('some.csv') def run_model(self): self.data = self.data.map(model.transformation_function)
參考:
https://stackoverflow.com/questions/32505426/how-to-process-rdds-using-a-python-class