CentOS使用scrapy-splash
阿新 • • 發佈:2018-11-16
準備工作
- 先完成簡單scrapy專案
- 安裝docker
- win下下載安裝包安裝
- mac下下載安裝包安裝(嘗試使用brew安裝,安裝啟動過程非常複雜,最後選擇使用安裝包直接安裝)
- centos7下執行:
yum install docker
-
redhat執行:
yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
- 安裝 scrapy-splash
pip install scrapy-splash
- 啟動docker服務
- centos7
service docker start
-
win下直接開啟應用
- mac下直接開啟應用
- centos7
-
拉取映象
docker pull scrapinghub/splash
- 執行映象
docker run -p 8050:8050 scrapinghub/splash
- 配置splash服務(以下操作全部在settings.py):
- 新增splash伺服器地址:
SPLASH_URL = ‘http://localhost:8050’
-
將splash middleware新增到DOWNLOADER_MIDDLEWARE中:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
- Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
- Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
- a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
- 新增splash伺服器地址:
- 例子
import json, scrapy lass MySpider(scrapy.Spider): name = 'example' allowed_domains = ['example.com'] start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): # ...