scrapy 去重策略修改
阿新 • • 發佈:2018-11-11
1、首先自定義一個‘duplication.py’檔案:
class RepeatFilter(object): def __init__(self): """ 2、物件初始化 """ self.visited_set = set() @classmethod def from_settings(cls, settings): """ 1、建立物件 :param settings: :return: """ print('......') return cls() def request_seen(self, request): """ 4、檢查是否已經訪問過 :param request: :return: """ if request.url in self.visited_set: return True self.visited_set.add(request.url) return False def open(self): #can return deferred """ 3、開始爬取 :return: """ print('open') pass def close(self, reason): # can return a deferred """ 5、停止爬取 :param reason: :return: """ print('close') pass def log(self, request, spider): #log that a request has been filtered pass
2、修改settings檔案,新增
DUPEFILTER_CLASS = 'day96.duplication.RepeatFilter'