python2 scrapy-redisd搭建,簡單使用。爬取豆瓣點評
Scrapy 和 scrapy-redis的區別
Scrapy 是一個通用的爬蟲框架,但是不支援分散式,Scrapy-redis是為了更方便地實現Scrapy分散式爬取,而提供了一些以redis為基礎的元件(僅有元件)。
Scrapy-redis提供了下面四種元件(components):(四種元件意味著這四個模組都要做相應的修改)
- Scheduler
- Duplication Filter
- Item Pipeline
- Base Spider
安裝scrapy-redis
pip install scrapy-redis
scrapy-redis架構
如上圖所⽰示,scrapy-redis在scrapy的架構上增加了redis,基於redis的特性拓展瞭如下元件:
Scheduler:
Scrapy改造了python本來的collection.deque(雙向佇列)形成了自己的Scrapy queue(https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py)),但是Scrapy多個spider不能共享待爬取佇列Scrapy queue, 即Scrapy本身不支援爬蟲分散式,scrapy-redis 的解決是把這個Scrapy queue換成redis資料庫(也是指redis佇列),從同一個redis-server存放要爬取的request,便能讓多個spider去同一個資料庫裡讀取。
Scrapy中跟“待爬佇列”直接相關的就是排程器
{
優先順序0 : 佇列0
優先順序1 : 佇列1
優先順序2 : 佇列2
}
然後根據request中的優先順序,來決定該入哪個佇列,出列時則按優先順序較小的優先出列。為了管理這個比較高階的佇列字典,Scheduler需要提供一系列的方法。但是原來的Scheduler已經無法使用,所以使用Scrapy-redis的scheduler元件。
Duplication Filter
Scrapy中用集合實現這個request去重功能,Scrapy中把已經發送的request指紋放入到一個集合中,把下一個request的指紋拿到集合中比對,如果該指紋存在於集合中,說明這個request傳送過了,如果沒有則繼續操作。這個核心的判重功能是這樣實現的:
def request_seen(self, request):
# self.request_figerprints就是一個指紋集合
fp = self.request_fingerprint(request)
# 這就是判重的核心操作
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
在scrapy-redis中去重是由Duplication Filter元件來實現的,它通過redis的set 不重複的特性,巧妙的實現了Duplication Filter去重。scrapy-redis排程器從引擎接受request,將request的指紋存⼊redis的set檢查是否重複,並將不重複的request push寫⼊redis的 request queue。
引擎請求request(Spider發出的)時,排程器從redis的request queue佇列⾥里根據優先順序pop 出⼀個request 返回給引擎,引擎將此request發給spider處理。
Item Pipeline:
引擎將(Spider返回的)爬取到的Item給Item Pipeline,scrapy-redis 的Item Pipeline將爬取到的 Item 存⼊redis的 items queue。
修改過Item Pipeline可以很方便的根據 key 從 items queue 提取item,從⽽實現 items processes叢集。
Base Spider
不在使用scrapy原有的Spider類,重寫的RedisSpider繼承了Spider和RedisMixin這兩個類,RedisMixin是用來從redis讀取url的類。
當我們生成一個Spider繼承RedisSpider時,呼叫setup_redis函式,這個函式會去連線redis資料庫,然後會設定signals(訊號):
一個是當spider空閒時候的signal,會呼叫spider_idle函式,這個函式呼叫schedule_next_request函式,保證spider是一直活著的狀態,並且丟擲DontCloseSpider異常。
一個是當抓到一個item時的signal,會呼叫item_scraped函式,這個函式會呼叫schedule_next_request函式,獲取下一個request。
爬取豆瓣點評
豆瓣內容為靜態的,便於爬取。作為本文示例。感謝豆瓣資料。
建立scrapy專案
RedisPipeline
在setting中設定Redis通道,爬到的資料會自動儲存到Redis資料庫中,若無明確指示,會儲存在本機的db0中。
ITEM_PIPELINES = {
'scrapy_redis_test.pipelines.spider1JsonPipeline' : 401,
'scrapy_redis.pipelines.RedisPipeline' : 400,
}
啟動Redis
爬蟲檔案Spider
# -*- coding: utf-8 -*-
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
import scrapy
from ..items import DoubanspiderItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start = 0
url = 'https://movie.douban.com/top250?start='
end = '&filter='
start_urls = [url + str(start) + end]
def parse(self, response):
item = DoubanspiderItem()
movies = response.xpath("//div[@class=\'info\']")
for each in movies:
title = each.xpath('div[@class="hd"]/a/span[@class="title"]/text()').extract()
content = each.xpath('div[@class="bd"]/p/text()').extract()
score = each.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()
info = each.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
item['title'] = title[0]
item['content'] = ';'.join(content) # 以;作為分隔,將content列表裡所有元素合併成一個新的字串
item['score'] = score[0]
item['info'] = info[0]
# 提交item
yield item
if self.start <= 225:
self.start += 25
yield scrapy.Request(self.url + str(self.start) + self.end, callback=self.parse)
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyRedisTestItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class DoubanspiderItem(scrapy.Item):
# 電影標題
title = scrapy.Field()
# 電影評分
score = scrapy.Field()
# 電影資訊
content = scrapy.Field()
# 簡介
info = scrapy.Field()
middlewares.py自動生成
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class ScrapyRedisTestSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ScrapyRedisTestDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
piplines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.conf import settings
import pymongo
import json
class ScrapyRedisTestPipeline(object):
def process_item(self, item, spider):
return item
class DoubanspiderPipeline(object):
def __init__(self):
# 獲取setting主機名、埠號和資料庫名
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbname = settings['MONGODB_DBNAME']
# pymongo.MongoClient(host, port) 建立MongoDB連結
client = pymongo.MongoClient(host=host,port=port)
# 指向指定的資料庫
mdb = client[dbname]
# 獲取資料庫裡存放資料的表名
self.post = mdb[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
data = dict(item)
# 向指定的表裡新增資料
self.post.insert(data)
return item
class spider1JsonPipeline(object):
def __init__(self):
self.file = open('tiezi.json', 'wb')
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.file.write(content)
return item
def close_spider(self, spider):
self.file.close()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for scrapy_redis_test project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'scrapy_redis_test'
SPIDER_MODULES = ['scrapy_redis_test.spiders']
NEWSPIDER_MODULE = 'scrapy_redis_test.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_redis_test (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'scrapy_redis_test.middlewares.ScrapyRedisTestSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'scrapy_redis_test.middlewares.ScrapyRedisTestDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'scrapy_redis_test.pipelines.ScrapyRedisTestPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 使用了scrapy-redis裡的去重元件,不使用scrapy預設的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用了scrapy-redis裡的排程器元件,不實用scrapy預設的排程器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 使用佇列形式
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 允許暫停,redis請求記錄不丟失
SCHEDULER_PERSIST = True
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'newdongguan (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"
ITEM_PIPELINES = {
'scrapy_redis_test.pipelines.spider1JsonPipeline' : 401,
'scrapy_redis.pipelines.RedisPipeline' : 400,
}
MONGODB_DBNAME = 'db0'
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 6379
FEED_EXPORT_ENCODING = 'utf-8'