Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

阿新 • • 發佈：2020-11-23

在上一章《Scrapy-Redis入門實戰》中我們利用scrapy-redis實現了京東圖書爬蟲的分散式部署和資料爬取。但存在以下問題：

每個爬蟲例項在啟動的時候，都必須從start_urls開始爬取，即每個爬蟲例項都會請求start_urls中的地址，屬重複請求，浪費系統資源。

為了解決這一問題，Scrapy-Redis提供了RedisSpider與RedisCrawlSpider兩個爬蟲類，繼承自這兩個類的Spider在啟動的時候能夠從指定的Redis列表中去獲取start_urls；任意爬蟲例項從Redis列表中獲取某一 url 時會將其從列表中彈出，因此其他爬蟲例項將不能重複讀取該 url ；對於那些未從Redis列表獲取到初始 url 的爬蟲例項將一直處於阻塞狀態，直到 start_urls列表中被插入新的起始地址或者Redis的Requests列表中出現待處理的請求。

在這裡，我們以爬取噹噹網圖書資訊為例對這兩個Spider的用法進行簡單示例。

settings.py 配置如下：

# -*- coding: utf-8 -*-

BOT_NAME = 'dang_dang'

SPIDER_MODULES = ['dang_dang.spiders']
NEWSPIDER_MODULE = 'dang_dang.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/71.0.3578.98 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False


######################################################
##############下面是Scrapy-Redis相關配置################
######################################################

# 指定Redis的主機名和埠
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# 排程器啟用Redis儲存Requests佇列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 確保所有的爬蟲例項使用Redis進行重複過濾
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 將Requests佇列持久化到Redis，可支援暫停或重啟爬蟲
SCHEDULER_PERSIST = True

# Requests的排程策略，預設優先順序佇列
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# 將爬取到的items儲存到Redis 以便進行後續處理
ITEM_PIPELINES = {
  'scrapy_redis.pipelines.RedisPipeline': 300
}

RedisSpider程式碼示例

# -*- coding: utf-8 -*-
import scrapy
import re
import urllib
from copy import deepcopy
from scrapy_redis.spiders import RedisSpider


class DangdangSpider(RedisSpider):
  name = 'dangdang'
  allowed_domains = ['dangdang.com']
  redis_key = 'dangdang:book'
  pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html",re.I)

  # def __init__(self,*args,**kwargs):
  #   # 動態定義可爬取的域範圍
  #   domain = kwargs.pop('domain','')
  #   self.allowed_domains = filter(None,domain.split(','))
  #   super(DangdangSpider,self).__init__(*args,**kwargs)

  def parse(self,response): # 從首頁提取圖書分類資訊
    # 提取一級分類元素
    div_list = response.xpath("//div[@class='con flq_body']/div")
    for div in div_list:
      item = {}
      item["b_cate"] = div.xpath("./dl/dt//text()").extract()
      item["b_cate"] = [i.strip() for i in item["b_cate"] if len(i.strip()) > 0]
      # 提取二級分類元素
      dl_list = div.xpath("./div//dl[@class='inner_dl']")
      for dl in dl_list:
        item["m_cate"] = dl.xpath(".//dt/a/@title").extract_first()
        # 提取三級分類元素
        a_list = dl.xpath("./dd/a")
        for a in a_list:
          item["s_cate"] = a.xpath("./text()").extract_first()
          item["s_href"] = a.xpath("./@href").extract_first()
          if item["s_href"] is not None and self.pattern.match(item["s_href"]) is not None:
            yield scrapy.Request(item["s_href"],callback=self.parse_book_list,meta={"item": deepcopy(item)})

  def parse_book_list(self,response): # 從圖書列表頁提取資料
    item = response.meta['item']
    li_list = response.xpath("//ul[@class='bigimg']/li")
    for li in li_list:
      item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first()
      if item["book_img"] == "images/model/guan/url_none.png":
        item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first()
      item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first()
      item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first()
      item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first()
      item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first()
      item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
      if item["book_publish_date"] is not None:
        item["book_publish_date"] = item["book_publish_date"].replace('/','')
      item["book_press"] = li.xpath("./p[@class='search_book_author']/span[3]/a/text()").extract_first()
      yield deepcopy(item)

    # 提取下一頁地址
    next_url = response.xpath("//li[@class='next']/a/@href").extract_first()
    if next_url is not None:
      next_url = urllib.parse.urljoin(response.url,next_url)
      yield scrapy.Request(next_url,meta={"item": item})

當Redis 的dangdang:book鍵所對應的start_urls列表為空時，啟動DangdangSpider爬蟲會進入到阻塞狀態等待列表中被插入資料，控制檯提示內容類似下面這樣：

2019-05-08 14:02:53 [scrapy.core.engine] INFO: Spider opened
2019-05-08 14:02:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2019-05-08 14:02:53 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

此時需要向start_urls列表中插入爬蟲的初始爬取地址，向Redis列表中插入資料可使用如下命令：

lpush dangdang:book http://book.dangdang.com/

命令執行完後稍等片刻DangdangSpider便會開始爬取資料，爬取到的資料結構如下圖所示：

Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

RedisCrawlSpider程式碼示例

# -*- coding: utf-8 -*-
import scrapy
import re
import urllib
from copy import deepcopy
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider


class DangdangCrawler(RedisCrawlSpider):
  name = 'dangdang2'
  allowed_domains = ['dangdang.com']
  redis_key = 'dangdang:book'
  pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html",re.I)

  rules = (
    Rule(LinkExtractor(allow=r'(http|https)://category.dangdang.com/cp(.*?).html'),callback='parse_book_list',follow=False),)

  def parse_book_list(self,response): # 從圖書列表頁提取資料
    item = {}
    item['book_list_page'] = response._url
    li_list = response.xpath("//ul[@class='bigimg']/li")
    for li in li_list:
      item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first()
      if item["book_img"] == "images/model/guan/url_none.png":
        item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first()
      item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first()
      item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first()
      item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first()
      item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first()
      item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
      if item["book_publish_date"] is not None:
        item["book_publish_date"] = item["book_publish_date"].replace('/',callback=self.parse_book_list)

與DangdangSpider爬蟲類似，DangdangCrawler在獲取不到初始爬取地址時也會阻塞在等待狀態，當start_urls列表中有地址即開始爬取，爬取到的資料結構如下圖所示：

Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

到此這篇關於Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解的文章就介紹到這了,更多相關Scrapy-Redis之RedisSpider與RedisCrawlSpider內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們！

Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

【Redis】Redis 持久化之 RDB 與 AOF 詳解

scrapy redis配置檔案setting引數詳解

scrapy-redis原始碼分析之傳送POST請求詳解

mysql儲存過程之遊標（DECLARE）原理與用法詳解

行轉列之SQL SERVER PIVOT與用法詳解

Linux下redis的持久化、主從同步與哨兵詳解

資料庫效能測試之sysbench工具的安裝與用法詳解

Python基本語法之運算子功能與用法詳解

Java基礎之反射原理與用法詳解

Java基礎之代理原理與用法詳解

python GUI庫圖形介面開發之PyQt5訊號與槽的高階使用技巧(自定義訊號與槽)詳解與例項

Python面向物件程式設計之繼承、多型原理與用法詳解

javascript 設計模式之組合模式原理與應用詳解

javascript 設計模式之享元模式原理與應用詳解

Java設計模式之觀察者模式原理與用法詳解

JavaScript 檔案載入與阻塞問題之效能優化案例詳解

Java基礎之Web伺服器與Http詳解

Nginx編譯安裝引數與目錄詳解 nginx編譯安裝之-./configure 引數詳解

來啦！濾波SLAM之MSCKF原理解析與原始碼詳解

Scrapy-Redis之RedisSpider與RedisCrawlSpider詳解

相關推薦