python2 scrapy-redisd搭建,簡單使用。爬取豆瓣點評

阿新 • • 發佈：2018-12-13

Scrapy 和 scrapy-redis的區別

Scrapy 是一個通用的爬蟲框架，但是不支援分散式，Scrapy-redis是為了更方便地實現Scrapy分散式爬取，而提供了一些以redis為基礎的元件(僅有元件)。

Scrapy-redis提供了下面四種元件（components）：(四種元件意味著這四個模組都要做相應的修改)

Scheduler
Duplication Filter
Item Pipeline
Base Spider

安裝scrapy-redis

pip install scrapy-redis

scrapy-redis架構

如上圖所⽰示，scrapy-redis在scrapy的架構上增加了redis，基於redis的特性拓展瞭如下元件：

Scheduler：

Scrapy改造了python本來的collection.deque(雙向佇列)形成了自己的Scrapy queue(https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py))，但是Scrapy多個spider不能共享待爬取佇列Scrapy queue，即Scrapy本身不支援爬蟲分散式，scrapy-redis 的解決是把這個Scrapy queue換成redis資料庫（也是指redis佇列），從同一個redis-server存放要爬取的request，便能讓多個spider去同一個資料庫裡讀取。

Scrapy中跟“待爬佇列”直接相關的就是排程器

Scheduler，它負責對新的request進行入列操作（加入Scrapy queue），取出下一個要爬取的request（從Scrapy queue中取出）等操作。它把待爬佇列按照優先順序建立了一個字典結構，比如：

{

優先順序0 : 佇列0

優先順序1 : 佇列1

優先順序2 : 佇列2

}

然後根據request中的優先順序，來決定該入哪個佇列，出列時則按優先順序較小的優先出列。為了管理這個比較高階的佇列字典，Scheduler需要提供一系列的方法。但是原來的Scheduler已經無法使用，所以使用Scrapy-redis的scheduler元件。

Duplication Filter

Scrapy中用集合實現這個request去重功能，Scrapy中把已經發送的request指紋放入到一個集合中，把下一個request的指紋拿到集合中比對，如果該指紋存在於集合中，說明這個request傳送過了，如果沒有則繼續操作。這個核心的判重功能是這樣實現的：

    def request_seen(self, request):

        # self.request_figerprints就是一個指紋集合  
        fp = self.request_fingerprint(request)
        
        # 這就是判重的核心操作  
        if fp in self.fingerprints:
            return True
        
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

在scrapy-redis中去重是由Duplication Filter元件來實現的，它通過redis的set 不重複的特性，巧妙的實現了Duplication Filter去重。scrapy-redis排程器從引擎接受request，將request的指紋存⼊redis的set檢查是否重複，並將不重複的request push寫⼊redis的 request queue。

引擎請求request(Spider發出的）時，排程器從redis的request queue佇列⾥里根據優先順序pop 出⼀個request 返回給引擎，引擎將此request發給spider處理。

Item Pipeline：

引擎將(Spider返回的)爬取到的Item給Item Pipeline，scrapy-redis 的Item Pipeline將爬取到的 Item 存⼊redis的 items queue。

修改過Item Pipeline可以很方便的根據 key 從 items queue 提取item，從⽽實現 items processes叢集。

Base Spider

不在使用scrapy原有的Spider類，重寫的RedisSpider繼承了Spider和RedisMixin這兩個類，RedisMixin是用來從redis讀取url的類。

當我們生成一個Spider繼承RedisSpider時，呼叫setup_redis函式，這個函式會去連線redis資料庫，然後會設定signals(訊號)：

一個是當spider空閒時候的signal，會呼叫spider_idle函式，這個函式呼叫schedule_next_request函式，保證spider是一直活著的狀態，並且丟擲DontCloseSpider異常。

一個是當抓到一個item時的signal，會呼叫item_scraped函式，這個函式會呼叫schedule_next_request函式，獲取下一個request。

爬取豆瓣點評

豆瓣內容為靜態的，便於爬取。作為本文示例。感謝豆瓣資料。

建立scrapy專案

RedisPipeline

在setting中設定Redis通道，爬到的資料會自動儲存到Redis資料庫中，若無明確指示，會儲存在本機的db0中。

ITEM_PIPELINES = {
    'scrapy_redis_test.pipelines.spider1JsonPipeline' : 401,
    'scrapy_redis.pipelines.RedisPipeline' : 400,
}

啟動Redis

爬蟲檔案Spider

# -*- coding: utf-8 -*-
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
import scrapy
from ..items import DoubanspiderItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

class DoubanSpider(scrapy.Spider):

    name = "douban"
    allowed_domains = ["movie.douban.com"]
    start = 0
    url = 'https://movie.douban.com/top250?start='
    end = '&filter='
    start_urls = [url + str(start) + end]


    def parse(self, response):

        item = DoubanspiderItem()
        movies = response.xpath("//div[@class=\'info\']")
        for each in movies:
            title = each.xpath('div[@class="hd"]/a/span[@class="title"]/text()').extract()
            content = each.xpath('div[@class="bd"]/p/text()').extract()
            score = each.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()
            info = each.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()

            item['title'] = title[0]
            item['content'] = ';'.join(content) # 以;作為分隔，將content列表裡所有元素合併成一個新的字串
            item['score'] = score[0]
            item['info'] = info[0]
            # 提交item
            yield item

        if self.start <= 225:
            self.start += 25
            yield scrapy.Request(self.url + str(self.start) + self.end, callback=self.parse)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyRedisTestItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
class DoubanspiderItem(scrapy.Item):
    # 電影標題
    title = scrapy.Field()
    # 電影評分
    score = scrapy.Field()
    # 電影資訊
    content = scrapy.Field()
    # 簡介
    info = scrapy.Field()

middlewares.py自動生成

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ScrapyRedisTestSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ScrapyRedisTestDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.conf import settings
import pymongo
import json



class ScrapyRedisTestPipeline(object):
    def process_item(self, item, spider):
        return item

class DoubanspiderPipeline(object):
    def __init__(self):
        # 獲取setting主機名、埠號和資料庫名
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        dbname = settings['MONGODB_DBNAME']

        # pymongo.MongoClient(host, port) 建立MongoDB連結
        client = pymongo.MongoClient(host=host,port=port)

        # 指向指定的資料庫
        mdb = client[dbname]
        # 獲取資料庫裡存放資料的表名
        self.post = mdb[settings['MONGODB_DOCNAME']]


    def process_item(self, item, spider):
        data = dict(item)
        # 向指定的表裡新增資料
        self.post.insert(data)
        return item
class spider1JsonPipeline(object):

    def __init__(self):
        self.file = open('tiezi.json', 'wb')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.file.write(content)
        return item

    def close_spider(self, spider):
        self.file.close()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for scrapy_redis_test project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_redis_test'

SPIDER_MODULES = ['scrapy_redis_test.spiders']
NEWSPIDER_MODULE = 'scrapy_redis_test.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_redis_test (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_redis_test.middlewares.ScrapyRedisTestSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapy_redis_test.middlewares.ScrapyRedisTestDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_redis_test.pipelines.ScrapyRedisTestPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


# 使用了scrapy-redis裡的去重元件，不使用scrapy預設的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用了scrapy-redis裡的排程器元件，不實用scrapy預設的排程器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 使用佇列形式
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 允許暫停，redis請求記錄不丟失
SCHEDULER_PERSIST = True

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'newdongguan (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"

ITEM_PIPELINES = {
    'scrapy_redis_test.pipelines.spider1JsonPipeline' : 401,
    'scrapy_redis.pipelines.RedisPipeline' : 400,
}
MONGODB_DBNAME = 'db0'
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 6379

FEED_EXPORT_ENCODING = 'utf-8'