1. 程式人生 > >Scrapy學習-18-去重原理

Scrapy學習-18-去重原理

.py sort bsp url none ont digest set request

Scrapy去重原理 scrapy本身自帶一個去重中間件   scrapy源碼中可以找到一個dupefilters.py去重器 源碼去重算法
# 將返回值放到集合set中,實現去重

def request_fingerprint(request, include_headers=None):
    if include_headers:
            include_headers = tuple(to_bytes(h.lower())
                                for h in sorted(include_headers))
    cache 
= _fingerprint_cache.setdefault(request, {}) if include_headers not in cache: fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b‘‘) if include_headers: for hdr in
include_headers: if hdr in request.headers: fp.update(hdr) for v in request.headers.getlist(hdr): fp.update(v) cache[include_headers] = fp.hexdigest() return cache[include_headers]

Scrapy學習-18-去重原理