spalsh安裝及簡單使用
阿新 • • 發佈:2022-03-22
-
selenium是瀏覽器測試自動化工具,很容易完成滑鼠點選,翻頁等動作,確定是一次只能載入一個頁面,無法非同步渲染頁面,也就限制了selenium爬蟲的抓取效率。
-
splash可以實現非同步渲染頁面,可以同時渲染幾個頁面。缺點是在頁面點選,,模擬登陸方面沒有selenium靈活。
1、 安裝docker
使用官方安裝指令碼自動安裝
安裝命令如下:curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun
也可以使用國內 daocloud 一鍵安裝命令:curl -sSL https://get.daocloud.io/docker | sh
2、docker安裝splash
docker安裝splash映象
[ywadmin@wzy_woyun ~]$docker pull scrapinghub/splash
#後臺執行
[ywadmin@wzy_woyun ~]$ docker run -d -p 8050:8050 --name=splash scrapinghub/splash
#root使用者開放8050埠
[root@wzy_woyun ~]# firewall-cmd --permanent --add-port=8050/tcp success [root@wzy_woyun ~]# firewall-cmd --reload Success
splash啟動
開啟docker splash 服務
1.先啟動docker
2.拉取splash映象
docker pull scrapinghub/splash
3.啟動splash服務
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
4.在瀏覽器上開啟
普通的python 動態lua指令碼
- 新增請求頭 請求url
function main(splash,args) local url=args.url splash:set_user_agent("Mozilla/5.0Chrome/69.0.3497.100Safari/537.36") splash:go(url) splash:wait(2) splash:go(url) return{ html=splash:html(), png = splash:png() } end
- 通過滑動 來完成動態載入
function main(splash, args) splash:go(args.url) local scroll_to = splash:jsfunc("window.scrollTo") scroll_to(0, 2800) splash:set_viewport_full() splash:wait(5) return {html=splash:html()} end
結合scarpy 來使用首先需要在settings中新增
SPLASH_URL = 'http://192.168.2.55:8050/' DOWNLOADER_MIDDLEWARES = { 'curreny.middlewares.ProcessAllException': 200, 'curreny.middlewares.CurrenyDownloaderMiddleware': 543, 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 快取 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
然後 在spider中新增lua指令碼
""" 平潭綜合實驗區人民政府 """ import copy import re import time import scrapy import scrapy_splash from curreny.items import CurrenyItem class PingtancomprehensiveexperimentgovproSpider(scrapy.Spider): name = 'PingTanComprehensiveExperimentGovPro' # allowed_domains = ['xxx.com'] start_urls = ['http://www.pingtan.gov.cn/jhtml/cn/8423'] def start_requests(self): lua=""" function main(splash, args) splash.images_enabled = false assert(splash:go(args.url)) assert(splash:wait(1)) js = string.format("document.querySelector('body > div.container > div.main.clearfix > div > div.page > span:nth-child(4) > a').click();", args.page) splash:runjs(js) assert(splash:wait(5)) return splash:html() end """ url="http://www.pingtan.gov.cn/jhtml/cn/8423" for page in range(1,105): yield scrapy_splash.SplashRequest( url=url, endpoint="execute", args={ "url":url, "lua_source":lua, "page":page, "wait":1 }, callback=self.parse ) def parse(self, response,**kwargs): item = CurrenyItem() for li in response.css("body > div.container > div.main.clearfix > div > div.info_list.list > ul > li"): item["title_url"] = 'http://www.pingtan.gov.cn' + str(li.css("a::attr(href)").get()) item["title_name"] = li.css("a::attr(title)").get() item["title_date"] = li.css("span::text").get() yield scrapy.Request( url=item['title_url'], callback=self.parse_detail, meta={'item': copy.deepcopy(item)} ) # 詳情頁解析 def parse_detail(self, response): item = response.meta['item'] item['content_html'] = response.css('.detail').get() print(item['title_name'], item['title_url'], item['title_date'], ) yield item