1. 程式人生 > >spark呼叫類內方法

spark呼叫類內方法


在pyspark中呼叫類方法,報錯

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

原因:

spark不允許在action或transformation中訪問SparkContext,如果你的action或transformation中引用了self,那麼spark會將整個物件進行序列化,並將其發到工作節點上,這其中就保留了SparkContext,即使沒有顯式的訪問它,它也會在閉包內被引用,所以會出錯。

解決:
應該將呼叫的類方法定義為靜態方法 @staticmethod

class model(object):
    @staticmethod
    def transformation_function(row):
        row = row.split(',')
        return row[0]+row[1]

    def __init__(self):
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function)


        
參考:
https://stackoverflow.com/questions/32505426/how-to-process-rdds-using-a-python-class