1. 程式人生 > >scrapy_splash模塊解析動態js

scrapy_splash模塊解析動態js

scrapy use ddl lin tom 動態 driver request mode

一般遇到動態加載的網頁就比較棘手,一般采用scrapy_splash和selenium這兩種方式來解決。貌似scrapy_splash更強大,因為就從爬取美團這個網站而言,scrapy_splash可以實現,selenium沒有實現。可能selenium沒有設置對吧,按理說都應該可以的。

首先需要你安裝scrapy_splash,需要用到docker。教程在網上自己找。很簡單。兩個命令的事,前提是你在linux環境下。安裝好之後訪問:http://127.0.0.1:8050/

安裝模塊 pip3 install scrapy-splash

scrapy-splash需要些lua腳本。比如加一些參數,請求頭之類的。具體語法百度吧,我也不太清楚。

創建好項目之後。需要在setting中設置幾個參數:

DOWNLOADER_MIDDLEWARES = {
    #scrapy_splash相關的中間件
    scrapy_splash.SplashCookiesMiddleware: 723,
    scrapy_splash.SplashMiddleware: 725,
    scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware: 810,
}

#‘scrapy_splash的去重的類
DUPEFILTER_CLASS = 
scrapy_splash.SplashAwareDupeFilter # 最後配置一個Cache存儲HTTPCACHE_STORAGE HTTPCACHE_STORAGE = scrapy_splash.SplashAwareFSCacheStorage

然後開始寫爬蟲程序,也很簡單,就是SplashRequest請求。然後指定需要執行的腳本。就會返回動態加載完成的頁面。

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import  time
from scrapy_splash import
SplashRequest script = """ function main(splash, args) assert(splash:wait(0.5)) splash:set_custom_headers({ [‘Accept‘] = ‘*/*‘, [‘Accept-Language‘] = ‘zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2‘, [‘Cache-Control‘] = ‘max-age=0‘, [‘Connection‘] = ‘keep-alive‘, [‘User-Agent‘] = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36‘ }) splash.private_mode_enabled = false assert(splash:go(args.url)) assert(splash:wait(10)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end """ class MeituanspiderSpider(scrapy.Spider): name = meituanSpider # allowed_domains = [‘zz.meituan.com‘] start_urls = [http://zz.meituan.com/meishi/pn/] def start_requests(self): yield SplashRequest(self.start_urls[0], callback=self.parse, endpoint=execute, args={lua_source: script, wait: 7}) def parse(self, response): # 店鋪名字 list=response.xpath(//*[@id="app"]/section/div/div[2]/div[2]/div[1]/ul/li/div[2]/a/h4/text()).extract() for i in list : print(i)

如果不寫腳本的話,是不會成功的,可能美團那邊做了限制,腳本也很簡單,就是加一些請求頭。

scrapy_splash模塊解析動態js