(5).去重url，爬取和去重分離

阿新 • • 發佈：2018-07-03

日誌 %s .com 生成 can 實例對象記錄日誌 lse 定制

# 新建py文件：duplication.py

# 我們新建了一個文件，專門用來去重。在scrapy源碼中已經把結構寫好了，我們只需復制粘貼過來
from scrapy.dupefilter import BaseDupeFilter
‘‘‘
class BaseDupeFilter(object):

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        return False

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass
‘‘‘
# 可以看到，以上就是scrapy中BaseDupeFilter這個類，框架結構幫我們搭好了，因此我們只需要自定制以下即可


class DupeFilter(object):

    # 使用構造方法，還是用之前的過濾方法
    def __init__(self):
        self.urls = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        # 這裏的request.url就是我們爬取的url
        # 如果在集合裏面，那麽返回True，意思是成功了不用再爬了
        if request.url in self.urls:
            return True

        # 不再集合裏面返回False，意思是錯誤，蟲子還沒有爬取此url
        self.urls.add(request.url)
        return False

    def open(self):  # 開始
        pass

    def close(self, reason):  # 結束
        pass

    def log(self, request, spider):  # 記錄日誌
        pass

# 可以看到@classmethod下的類方法，直接返回cls()，這在scrapy中非常常見，因此我們不用實例化
# scrapy會自動地調用這個方法，生成一個實例對象，因此我們只需要寫好相應的結構即可

主程序：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request



class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    # # 當遞歸查找時，會反復執行parse，因此md5_urls不能定義在parse函數裏面
    # md5_urls = set()
    # 將url添加到集合中，是我們自己自定制的方法，其實scrapy為我們準備了更好的去重方法

    def parse(self, response):
        # 通過返回結果，我們可以看到確實scrapy幫我們去重了
        print(response.url)
        ‘‘‘
        https://dig.chouti.com/
        https://dig.chouti.com/all/hot/recent/2
        https://dig.chouti.com/all/hot/recent/10
        https://dig.chouti.com/all/hot/recent/8
        https://dig.chouti.com/all/hot/recent/6
        https://dig.chouti.com/all/hot/recent/9
        https://dig.chouti.com/all/hot/recent/4
        https://dig.chouti.com/all/hot/recent/5
        https://dig.chouti.com/all/hot/recent/7
        https://dig.chouti.com/all/hot/recent/3
        https://dig.chouti.com/all/hot/recent/1
        https://dig.chouti.com/all/hot/recent/11
        https://dig.chouti.com/all/hot/recent/12
        https://dig.chouti.com/all/hot/recent/14
        https://dig.chouti.com/all/hot/recent/13
        https://dig.chouti.com/all/hot/recent/18
        https://dig.chouti.com/all/hot/recent/16
        https://dig.chouti.com/all/hot/recent/17
        https://dig.chouti.com/all/hot/recent/15
        https://dig.chouti.com/all/hot/recent/19
        https://dig.chouti.com/all/hot/recent/20
        https://dig.chouti.com/all/hot/recent/21
        https://dig.chouti.com/all/hot/recent/23
        https://dig.chouti.com/all/hot/recent/25
        https://dig.chouti.com/all/hot/recent/24
        https://dig.chouti.com/all/hot/recent/27
        https://dig.chouti.com/all/hot/recent/29
        https://dig.chouti.com/all/hot/recent/26
        https://dig.chouti.com/all/hot/recent/28
        https://dig.chouti.com/all/hot/recent/22
        https://dig.chouti.com/all/hot/recent/30
        https://dig.chouti.com/all/hot/recent/33
        https://dig.chouti.com/all/hot/recent/31
        https://dig.chouti.com/all/hot/recent/32
        https://dig.chouti.com/all/hot/recent/34
        https://dig.chouti.com/all/hot/recent/37
        https://dig.chouti.com/all/hot/recent/36
        https://dig.chouti.com/all/hot/recent/41
        https://dig.chouti.com/all/hot/recent/38
        https://dig.chouti.com/all/hot/recent/40
        https://dig.chouti.com/all/hot/recent/39
        https://dig.chouti.com/all/hot/recent/45
        https://dig.chouti.com/all/hot/recent/42
        https://dig.chouti.com/all/hot/recent/44
        https://dig.chouti.com/all/hot/recent/43
        https://dig.chouti.com/all/hot/recent/49
        https://dig.chouti.com/all/hot/recent/47
        https://dig.chouti.com/all/hot/recent/46
        https://dig.chouti.com/all/hot/recent/48
        https://dig.chouti.com/all/hot/recent/50
        https://dig.chouti.com/all/hot/recent/53
        https://dig.chouti.com/all/hot/recent/51
        https://dig.chouti.com/all/hot/recent/52
        https://dig.chouti.com/all/hot/recent/56
        https://dig.chouti.com/all/hot/recent/57
        https://dig.chouti.com/all/hot/recent/55
        https://dig.chouti.com/all/hot/recent/35
        https://dig.chouti.com/all/hot/recent/54
        https://dig.chouti.com/all/hot/recent/59
        https://dig.chouti.com/all/hot/recent/60
        https://dig.chouti.com/all/hot/recent/61
        https://dig.chouti.com/all/hot/recent/58
        https://dig.chouti.com/all/hot/recent/62
        https://dig.chouti.com/all/hot/recent/63
        https://dig.chouti.com/all/hot/recent/64
        https://dig.chouti.com/all/hot/recent/65
        https://dig.chouti.com/all/hot/recent/66
        https://dig.chouti.com/all/hot/recent/67
        https://dig.chouti.com/all/hot/recent/68
        https://dig.chouti.com/all/hot/recent/69
        https://dig.chouti.com/all/hot/recent/70
        https://dig.chouti.com/all/hot/recent/71
        https://dig.chouti.com/all/hot/recent/73
        https://dig.chouti.com/all/hot/recent/72
        https://dig.chouti.com/all/hot/recent/74
        https://dig.chouti.com/all/hot/recent/76
        https://dig.chouti.com/all/hot/recent/75
        https://dig.chouti.com/all/hot/recent/77
        https://dig.chouti.com/all/hot/recent/78
        https://dig.chouti.com/all/hot/recent/79
        https://dig.chouti.com/all/hot/recent/80
        https://dig.chouti.com/all/hot/recent/81
        https://dig.chouti.com/all/hot/recent/82
        https://dig.chouti.com/all/hot/recent/83
        https://dig.chouti.com/all/hot/recent/84
        https://dig.chouti.com/all/hot/recent/85
        https://dig.chouti.com/all/hot/recent/86
        https://dig.chouti.com/all/hot/recent/87
        https://dig.chouti.com/all/hot/recent/88
        https://dig.chouti.com/all/hot/recent/89
        https://dig.chouti.com/all/hot/recent/90
        https://dig.chouti.com/all/hot/recent/92
        https://dig.chouti.com/all/hot/recent/91
        https://dig.chouti.com/all/hot/recent/93
        https://dig.chouti.com/all/hot/recent/94
        https://dig.chouti.com/all/hot/recent/97
        https://dig.chouti.com/all/hot/recent/95
        https://dig.chouti.com/all/hot/recent/96
        https://dig.chouti.com/all/hot/recent/98
        https://dig.chouti.com/all/hot/recent/99
        https://dig.chouti.com/all/hot/recent/100
        https://dig.chouti.com/all/hot/recent/101
        https://dig.chouti.com/all/hot/recent/102
        https://dig.chouti.com/all/hot/recent/103
        https://dig.chouti.com/all/hot/recent/104
        https://dig.chouti.com/all/hot/recent/105
        https://dig.chouti.com/all/hot/recent/108
        https://dig.chouti.com/all/hot/recent/106
        https://dig.chouti.com/all/hot/recent/107
        https://dig.chouti.com/all/hot/recent/109
        https://dig.chouti.com/all/hot/recent/111
        https://dig.chouti.com/all/hot/recent/110
        https://dig.chouti.com/all/hot/recent/112
        https://dig.chouti.com/all/hot/recent/113
        https://dig.chouti.com/all/hot/recent/114
        https://dig.chouti.com/all/hot/recent/115
        https://dig.chouti.com/all/hot/recent/116
        https://dig.chouti.com/all/hot/recent/117
        https://dig.chouti.com/all/hot/recent/120
        https://dig.chouti.com/all/hot/recent/118
        https://dig.chouti.com/all/hot/recent/119
        ‘‘‘
        # 這裏我們要如何去重呢？新建一個文件定義一個類
        res2 = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        for url in res2:
            # 之間的統統都可以不要了
            url = "https://dig.chouti.com%s" % url
            yield Request(url=url, callback=self.parse)

配置文件：

DEPTH_LIMIT = 0

# 當然在配置文件裏，必須指定一下，過濾所用到的類
# 這樣才會用我們定義的類進行過濾
DUPEFILTER_CLASS = ‘chouti.duplication.DupeFilter‘

(5).去重url，爬取和去重分離

日誌 %s .com 生成 can 實例對象記錄日誌 lse 定制 # 新建py文件：duplication.py # 我們新建了一個文件，專門用來去重。在scrapy源碼中已經把結構寫好了，我們只需復制粘貼過來 from scrapy.dupefilter impor

用redis實現scrapy的url去重與增量爬取

scrapy 自帶了去重方案，通過RFPDupeFilter類完成去重，檢視原始碼。 def request_seen(self, request): fp = self.request_fingerprint(request) if fp

scrapy利用redis實現url去重與增量爬取

引言之前資料採集時有2個需求就是url去重與資料的增量爬去（只能夠請求增加的url，不然會增加被爬網站的伺服器負荷）,最開始的想法是指用redis的set實現url去重，但在後面開發中無意的解決了增量爬去的類容。下面貼上主要程式碼。具體實現步驟將每次爬去的連結存入redis（

Python爬蟲開源項目代碼，爬取微信、淘寶、豆瓣、知乎、新浪微博、QQ、去哪網等代碼整理

http server 以及 pro 模擬登錄取數存在漏洞搜狗作者：SFLYQ 今天為大家整理了32個Python爬蟲項目。整理的原因是，爬蟲入門簡單快速，也非常適合新入門的小夥伴培養信心。所有鏈接指向GitHub，祝大家玩的愉快~ 1、WechatSogou

[Python] [爬蟲] 5.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——網頁下載器

目錄 1.Intro 2.Source 1.Intro 檔名：pageDownloader.py 模組名：網頁下載器引用庫： selenium random sys socket tim

scrapy+ selenium的小案例兩則，爬取食品藥品監管和twitter使用者資料。

環境：python 3.6 scrapy selenium chrome chrome-driver windows 10 如何安裝python selenium 和對應谷歌版本的chrome-driver請自行在csdn中搜索。已經有很多大手子做過很詳細

爬取和採集趕集網租房資訊，看過來~~

本文主要介紹“趕集出租房資訊採集爬蟲”（以下簡稱“趕集租房爬蟲”）的使用教程及注意事項。趕集網二手房源資訊時效性很強，使用者基數龐大，資訊清晰明瞭，這些資訊蘊含著很高的商機，能實時的反應現階段某一地區的租房現狀和趨勢，那如何採集呢？下面，教你如何使用“趕集租房爬蟲”在雲端採集和匯

自制爬蟲，爬取分類總閱讀量，總評論量。全部文章閱讀量和，以及評論量和。但是發現數據不對

def tt(a,name): global ss global cc sum = 0 com = 0 pages = 10 x = 1 nn = [] mm = [] 其中global，很重要。沒有這個global會

thinkphp之URL，跳轉和重定向

以下是專案所接觸到的知識，根據ThinkPHP3.1.2完全開發手冊總結。一.URL(注意：模組首字母必須大寫，分組名如果是大寫資料夾名也要大寫，在linux上對大小寫特別敏感，而wamp不會，之前上傳到伺服器搞死我) 1.URL模式，在config.php配置 URL

python爬取連結去重

from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html = url

xpath爬蟲例項，爬取圖片網站百度盤地址和提取碼

某套圖網站，套圖以封面形式展現在頁面，需要依次點選套圖，點選廣告盤連結，最後到達百度網盤展示頁面。這一過程通過爬蟲來實現，收集百度網盤地址和提取碼，採用xpath爬蟲技術 1、首先分析圖片列表頁，該頁按照更新先後順序暫時套圖封面，檢視HTML結構。每一組“li”對應一組套圖。屬

js獲取url，截取url參數，截取url後文件名

logs ica cati exp 得到 explore test www. html 獲取當前: var url = window.location.href; 百度為例： var url=window.location.href; console.info(ur

Python爬蟲小實踐：尋找失蹤人口，爬取失蹤兒童信息並寫成csv文件，方便存入數據庫

python tor enc mini 執行 gem view 獲取但是前兩天有人私信我，讓我爬這個網站，http://bbs.baobeihuijia.com/forum-191-1.html上的失蹤兒童信息，準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童，這

爬蟲，爬取句子迷《龍族》

chrom ide win true res spa rom request file 踩了很多坑，主要是python2編碼的問題和正則不熟直接上腳本 # -*- coding: gbk -*- import re import urllib2 import time

itchat+pillow實現微信好友頭像爬取和拼接

rsa ava itchat ioerror 縮小獲取好友列表 con cmd body 源碼下載鏈接：https://pan.baidu.com/s/1cPZhwy 密碼：2t2o ###效果圖使用方法：下載項目到本地，打開項目主目錄，打開命令行，輸入：

圖片爬取和思路分析

pen key compile cat file grid 數據 ret info //我先準備一個記事本準備頁面網址 s=頁數 q=搜索值 //爬取相對的網頁數據 //找到圖片的網址在源代碼中ctrl+f 找到TB1xjf5dFhM8kjiszfwxxcib的代碼復制

微信，爬取每日一句，發送至多人，多個群

ever ear con nbsp ret gin 定時任務 linu url Timer(5, send_news) 每日一句，發送至多人，多個群 1 # -*- coding: utf-8 -*- 2 #from __future__ import unicod

2. github創建 git倉庫，克隆，拉取和推送操作（所有的git命令前提是一定在當前項目目錄下）

push mit 什麽 one alt class 所有 -m 註冊步驟：　　1.點擊小貓，回到初始頁面　　2.點擊start a project，首先會出現一個驗證email地址（我們註冊的時候，有一個email輸入，進去郵箱驗證一下） 3.重新點擊sta

我的第一個爬蟲，爬取北京地區短租房信息

爬取 connect except links 效率 chrom cti clas 爬蟲 # 導入程序所需要的庫。import requestsfrom bs4 import BeautifulSoupimport time# 加入請求頭偽裝成瀏覽器headers = {

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏一用API搜尋GitHub中star數最多的前十個庫利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準

(5).去重url，爬取和去重分離

相關推薦