Scrapy學習-18-去重原理
阿新 • • 發佈:2018-05-23
.py sort bsp url none ont digest set request Scrapy去重原理
scrapy本身自帶一個去重中間件
scrapy源碼中可以找到一個dupefilters.py去重器
源碼去重算法
# 將返回值放到集合set中,實現去重 def request_fingerprint(request, include_headers=None): if include_headers: include_headers = tuple(to_bytes(h.lower()) for h in sorted(include_headers)) cache= _fingerprint_cache.setdefault(request, {}) if include_headers not in cache: fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b‘‘) if include_headers: for hdr ininclude_headers: if hdr in request.headers: fp.update(hdr) for v in request.headers.getlist(hdr): fp.update(v) cache[include_headers] = fp.hexdigest() return cache[include_headers]
Scrapy學習-18-去重原理