1. 程式人生 > >CentOS使用scrapy-splash

CentOS使用scrapy-splash

準備工作

  • 先完成簡單scrapy專案
  • 安裝docker
    • win下下載安裝包安裝
    • mac下下載安裝包安裝(嘗試使用brew安裝,安裝啟動過程非常複雜,最後選擇使用安裝包直接安裝)
    • centos7下執行:

      yum install docker

  • redhat執行:

    yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
    
  • 安裝 scrapy-splash
    pip install scrapy-splash
    
  • 啟動docker服務
    • centos7

      service docker start

    • win下直接開啟應用

    • mac下直接開啟應用
  • 拉取映象

    docker pull scrapinghub/splash
    
  • 執行映象
    docker run -p 8050:8050 scrapinghub/splash
    
  • 配置splash服務(以下操作全部在settings.py):
    • 新增splash伺服器地址:

      SPLASH_URL = ‘http://localhost:8050’

    • 將splash middleware新增到DOWNLOADER_MIDDLEWARE中:

      DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
      }
      
    • Enable SplashDeduplicateArgsMiddleware:
      SPIDER_MIDDLEWARES = {
          'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
      }
      
    • Set a custom DUPEFILTER_CLASS:
      DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
      
    • a custom cache storage backend:
      HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
      
  • 例子
    import json, scrapy
    
    lass MySpider(scrapy.Spider):
       name = 'example'
       allowed_domains = ['example.com']
       start_urls = ["http://example.com", "http://example.com/foo"]
    
       def start_requests(self):
         for url in self.start_urls:
           yield SplashRequest(url, self.parse, args={'wait': 0.5})
    
       def parse(self, response):
           # ...