關於scrapy-splash使用以及如何設定代理ip
阿新 • • 發佈:2020-12-22
首先我們先介紹下如何使用scrapy-splash:
1、安裝:$ pip install scrapy-splash
2、啟動docker:$ docker run -p 8050:8050 scrapinghub/splash
3、在setting.py檔案中配置:
3.1、SPLASH_URL = 'http://192.168.59.103:8050' 3.2、 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } 3.3、SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } 3.4、DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 3.5、HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
以上就已經配置好scrapy-splash了,接著就是我們如何來使用。
這裡我們以京東某商品為例抓取:
spider.py
from scrapy.spiders import CrawlSpider, Spider from scrapy_splash import SplashRequest class TaoBaoSpider(CrawlSpider): name = 'taobao_spider' start_urls = ['https://item.jd.com/4736647.html?cpdad=1DLSUE'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url, callback=self.parse, args={'wait': '0.5'}) def parse(self, response): pic = response.xpath('//span[@class="price J-p-4736647"]/text()').extract()[0] print pic
抓取到商品價格:
image.png
現在我們需要給我們的scrapy新增代理中介軟體
middlewares.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['splash']['args']['proxy'] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
- 這裡我們需要注意的是設定代理不再是
request.meta['proxy'] = proxyServer
request.meta['splash'] ['args']['proxy'] = proxyServer
接著我們把ProxyMiddleware新增到setting.py中
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'Spider.middlewares.ProxyMiddleware': 843,
}
- 自定義的中介軟體的權重需要在scrapy-splash的後面才行。
這樣就可以使用代理用scrapy-splash愉快的抓取資料了!
作者:sunoath
連結:https://www.jianshu.com/p/7ec32ee1e9d4
來源:簡書
著作權歸作者所有。商業轉載請聯絡作者獲得授權,非商業轉載請註明出處。