scrapy_splash模塊解析動態js

阿新 • • 發佈：2018-10-29

scrapy use ddl lin tom 動態 driver request mode

一般遇到動態加載的網頁就比較棘手，一般采用scrapy_splash和selenium這兩種方式來解決。貌似scrapy_splash更強大，因為就從爬取美團這個網站而言，scrapy_splash可以實現，selenium沒有實現。可能selenium沒有設置對吧，按理說都應該可以的。

首先需要你安裝scrapy_splash，需要用到docker。教程在網上自己找。很簡單。兩個命令的事，前提是你在linux環境下。安裝好之後訪問：http://127.0.0.1:8050/

安裝模塊 pip3 install scrapy-splash

scrapy-splash需要些lua腳本。比如加一些參數，請求頭之類的。具體語法百度吧，我也不太清楚。

創建好項目之後。需要在setting中設置幾個參數：

DOWNLOADER_MIDDLEWARES = {
    #scrapy_splash相關的中間件
    ‘scrapy_splash.SplashCookiesMiddleware‘: 723,
    ‘scrapy_splash.SplashMiddleware‘: 725,
    ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘: 810,
}

#‘scrapy_splash的去重的類
DUPEFILTER_CLASS = ‘ 
scrapy_splash.SplashAwareDupeFilter‘
# 最後配置一個Cache存儲HTTPCACHE_STORAGE
HTTPCACHE_STORAGE = ‘scrapy_splash.SplashAwareFSCacheStorage‘

然後開始寫爬蟲程序，也很簡單，就是SplashRequest請求。然後指定需要執行的腳本。就會返回動態加載完成的頁面。

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import  time
from scrapy_splash import 
 SplashRequest
script = """
function main(splash, args)
  assert(splash:wait(0.5))
  splash:set_custom_headers({
    [‘Accept‘] = ‘*/*‘,
    [‘Accept-Language‘] = ‘zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2‘,
    [‘Cache-Control‘] = ‘max-age=0‘,
    [‘Connection‘] = ‘keep-alive‘,
    [‘User-Agent‘] = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36‘
  })
  splash.private_mode_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(10))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
"""

class MeituanspiderSpider(scrapy.Spider):
    name = ‘meituanSpider‘
    # allowed_domains = [‘zz.meituan.com‘]
    start_urls = [‘http://zz.meituan.com/meishi/pn/‘]

    def start_requests(self):
        yield SplashRequest(self.start_urls[0], callback=self.parse, endpoint=‘execute‘,
                            args={‘lua_source‘: script, ‘wait‘: 7})

    def parse(self, response):
        # 店鋪名字
        list=response.xpath(‘//*[@id="app"]/section/div/div[2]/div[2]/div[1]/ul/li/div[2]/a/h4/text()‘).extract()
        for i in list :
            print(i)

如果不寫腳本的話，是不會成功的，可能美團那邊做了限制，腳本也很簡單，就是加一些請求頭。

scrapy_splash模塊解析動態js

scrapy use ddl lin tom 動態 driver request mode 一般遇到動態加載的網頁就比較棘手，一般采用scrapy_splash和selenium這兩種方式來解決。貌似scrapy_splash更強大，因為就從爬取美團這個網站而言，scrap

scrapy_splash模塊解析動態js

scrapy_splash模塊解析動態js

angular js模塊，angular js控制器

Python時間模塊之Time模塊解析

scrapy爬蟲框架setting模塊解析

Typescript和Node模塊解析策略

webpack-Module Resolution（模塊解析）

httpd中使用php模塊解析php網頁

es6的模塊化；js的模塊化

兄弟連區塊鏈教程btcpool礦池源碼分析JobMaker模塊解析

兄弟連區塊鏈教程btcpool礦池源碼分析BlockMaker模塊解析

Python socketserver模塊解析

postman(一)：主界面模塊解析

新手必備：大數據框架Hadoop主要模塊解析

Python使用Selenium和PhantomJS解析動態JS的網頁

限定某個目錄禁止解析php 、限制user_agent 、php的配制文件、PHP的動態擴展模塊

Node.js Path 模塊

jQuery源碼解析（架構與依賴模塊）

node.js利用captchapng模塊實現圖片驗證碼

node.js 導出當前模塊的內部成員

淺析JS中的模塊規範（CommonJS，AMD，CMD） http://www.2cto.com/kf/201411/348276.html

scrapy_splash模塊解析動態js

相關推薦