scrapy通過自定義類給爬取的url去重
阿新 • • 發佈:2018-11-05
之前我們是通過在parse函式裡設定集合來解決url去重的問題。
首先先在根目錄中建立一個新的duplication的py檔案,在from scrapy.dupefilter import RFPDupeFilter,在RFPDupeFilter原始碼中把BaseDupeFilter類複製到新建的duolication中。
class RepeatFilter(object): def __init__(self): self.visited_set = set() @classmethod def from_settings(cls, settings):#用類方法建立RepeatFilter類物件返回的是RepeatFliter() return cls() def request_seen(self, request):#過濾url的方法 if request.url in self.visited_set: return True else: self.visited_set.add(request.url) return False def open(self):#爬蟲開始 print("---開始爬取---") def close(self, reason): # 爬蟲結束 print("---爬取結束---") def log(self, request, spider): # 記錄日誌 pass
在request_open方法中把過濾的url方法寫好
執行順序是
1、from_setting
2、__init__
3、open
4、log
5、close
最後別忘了要再settings.py檔案中新增一條DUPEFILTER_CLASS = "shan.duplication.RepeatFilter"
預設的是DUPEFILTER_CLASS = "shan.dupefilter.RFPDupeFilter"
(venv) D:\shan>scrapy crawl chouti --nolog D:\shan\shan\spiders\chouti.py:9: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead from scrapy.dupefilter import RFPDupeFilter ---開始爬取--- https://dig.chouti.com/ https://dig.chouti.com/all/hot/recent/2 https://dig.chouti.com/all/hot/recent/3 https://dig.chouti.com/all/hot/recent/8 https://dig.chouti.com/all/hot/recent/5 https://dig.chouti.com/all/hot/recent/7 https://dig.chouti.com/all/hot/recent/6 https://dig.chouti.com/all/hot/recent/10 https://dig.chouti.com/all/hot/recent/9 https://dig.chouti.com/all/hot/recent/4 ---爬取結束---