爬蟲之scrapy-splash

阿新 • • 發佈：2019-01-12

什麼是splash

Splash是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器，Splash是用Python實現的，同時使用Twisted和QT。Twisted（QT）用來讓服務具有非同步處理能力，以發揮webkit的併發能力。

目前，為了加速頁面的載入速度，頁面的很多部分都是用JS生成的，而對於用scrapy爬蟲來說就是一個很大的問題，因為scrapy沒有JS engine，所以爬取的都是靜態頁面，對於JS生成的動態頁面都無法獲得

解決方案：

1、利用第三方中介軟體來提供JS渲染服務： scrapy-splash 等。

2、利用webkit或者基於webkit庫

　Splash是一個Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器，Splash是用Python實現的，同時使用Twisted和QT。Twisted（QT）用來讓服務具有非同步處理能力，以發揮webkit的併發能力。

下面就來講一下如何使用scrapy-splash：

1、利用pip安裝scrapy-splash庫：

2、`pip install scrapy-splash`

scrapy-splash使用的是Splash HTTP API，所以需要一個splash instance，一般採用docker執行splash，所以需要安裝docker

，具體參見：https://www.jianshu.com/p/c5795d4c7e44

安裝好後執行docker。docker成功安裝後，有“Docker Quickstart Terminal”圖示，雙擊他啟動

請注意上面畫紅框的地方，這是預設分配給你的ip，下面會用到。至此，docker工具就已經安裝好了

5、拉取映象(pull the image)：

$ docker pull scrapinghub/splash

這樣就正式啟動了。

6、用docker執行scrapinghub/splash服務：

安裝docker之後，官方文件給了docker啟動splash容器的命令（docker run -d -p 8050:8050 scrapinghub/splash），但一定要查閱splash文件，來了解啟動的相關引數。

比如我啟動的時候，就需要指定max-timeout引數。因為我操作js時間較長時，很有可能超出預設timeout時間，以防萬一我設定為3600（一小時），但對於本來js操作時間就不長的的同學，注意不要亂設定max-timeout。

$ docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

首次啟動會比較慢，載入一些東西，多次啟動會出現以下資訊

這時要關閉當前視窗，然後在程序管理器裡面關閉一些程序重新開啟

重新開啟Docker Quickstart Terminal，然後輸入：docker run -p 8050:8050 scrapinghub/splash

7、配置splash服務（以下操作全部在settings.py）：

1）新增splash伺服器地址：

2）將splash middleware新增到DOWNLOADER_MIDDLEWARE中：

3)Enable SplashDeduplicateArgsMiddleware:

4)Set a custom DUPEFILTER_CLASS:

5)a custom cache storage backend:

在settings.py檔案中，你需要額外的填寫下面的一些內容

# 渲染服務的url
SPLASH_URL = 'http://192.168.99.100:8050'

#下載器中介軟體
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http快取
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

8、正式抓取

如下圖：框住的資訊是要榨取的內容

對應的html

1、京東價：

　　　　抓取程式碼：prices = site.xpath(‘//span[@class=”p-price”]/span/text()’)

2、促銷

抓取程式碼：cxs = site.xpath(‘//div[@class=”J-prom-phone-jjg”]/em/text()’)

3、增值業務

抓取程式碼：value_addeds =site.xpath(‘//ul[@class=”choose-support lh”]/li/a/span/text()’)

4、重量

抓取程式碼：quality = site.xpath(‘//div[@id=”summary-weight”]/div[2]/text()’)

5、選擇顏色

抓取程式碼：colors = site.xpath(‘//div[@id=”choose-attr-1”]/div[2]/div/@title’)

6、選擇版本

抓取程式碼：versions = site.xpath(‘//div[@id=”choose-attr-2”]/div[2]/div/@data-value’)

7、購買方式

抓取程式碼：buy_style = site.xpath(‘//div[@id=”choose-type”]/div[2]/div/a/text()’)

8、套　　裝

抓取程式碼：suits = site.xpath(‘//div[@id=”choose-suits”]/div[2]/div/a/text()’)

9、增值保障

抓取程式碼：vaps = site.xpath(‘//div[@class=”yb-item-cat”]/div[1]/span[1]/text()’)

10、白條分期

抓取程式碼：stagings = site.xpath(‘//div[@class=”baitiao-list J-baitiao-list”]/div[@class=”item”]/a/strong/text()’)

9、執行splash服務

在抓取之前首先要啟動splash服務，命令：docker run -p 8050:8050 scrapinghub/splash，點選“Docker Quickstart Terminal” 圖示

10、執行scrapy crawl scrapy_splash

11、抓取資料

12、完整原始碼

1、SplashSpider

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from scrapy_splash import SplashMiddleware
from scrapy.http import Request, HtmlResponse
from scrapy.selector import Selector
from scrapy_splash import SplashRequest
from splash_test.items import SplashTestItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = open('output.txt', 'w')

class SplashSpider(Spider):
    name = 'scrapy_splash'
    start_urls = [
        'https://item.jd.com/2600240.html'
    ]

    # request需要封裝成SplashRequest
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url
                                , self.parse
                                , args={'wait': '0.5'}
                                # ,endpoint='render.json'
                                )

    def parse(self, response):

        # 本文只抓取一個京東連結，此連結為京東商品頁面，價格引數是ajax生成的。會把頁面渲染後的html存在html.txt
        # 如果想一直抓取可以使用CrawlSpider，或者把下面的註釋去掉
        site = Selector(response)
        it_list = []
        it = SplashTestItem()
        #京東價
        # prices = site.xpath('//span[@class="price J-p-2600240"]/text()')
        # it['price']= prices[0].extract()
        # print '京東價：'+ it['price']
        prices = site.xpath('//span[@class="p-price"]/span/text()')
        it['price'] = prices[0].extract()+ prices[1].extract()
        print '京東價：' + it['price']

        # 促　　銷
        cxs = site.xpath('//div[@class="J-prom-phone-jjg"]/em/text()')
        strcx = ''
        for cx in cxs:
            strcx += str(cx.extract())+' '
        it['promotion'] = strcx
        print '促銷:%s '% strcx

        # 增值業務
        value_addeds =site.xpath('//ul[@class="choose-support lh"]/li/a/span/text()')
        strValueAdd =''
        for va in value_addeds:
            strValueAdd += str(va.extract())+' '
        print '增值業務:%s ' % strValueAdd
        it['value_add'] = strValueAdd

        # 重量
        quality = site.xpath('//div[@id="summary-weight"]/div[2]/text()')
        print '重量:%s ' % str(quality[0].extract())
        it['quality']=quality[0].extract()

        #選擇顏色
        colors = site.xpath('//div[@id="choose-attr-1"]/div[2]/div/@title')
        strcolor = ''
        for color in colors:
            strcolor += str(color.extract()) + ' '
        print '選擇顏色:%s ' % strcolor
        it['color'] = strcolor

        # 選擇版本
        versions = site.xpath('//div[@id="choose-attr-2"]/div[2]/div/@data-value')
        strversion = ''
        for ver in versions:
            strversion += str(ver.extract()) + ' '
        print '選擇版本:%s ' % strversion
        it['version'] = strversion

        # 購買方式
        buy_style = site.xpath('//div[@id="choose-type"]/div[2]/div/a/text()')
        print '購買方式:%s ' % str(buy_style[0].extract())
        it['buy_style'] = buy_style[0].extract()

        # 套裝
        suits = site.xpath('//div[@id="choose-suits"]/div[2]/div/a/text()')
        strsuit = ''
        for tz in suits:
            strsuit += str(tz.extract()) + ' '
        print '套裝:%s ' % strsuit
        it['suit'] = strsuit

        # 增值保障
        vaps = site.xpath('//div[@class="yb-item-cat"]/div[1]/span[1]/text()')
        strvaps = ''
        for vap in vaps:
            strvaps += str(vap.extract()) + ' '
        print '增值保障:%s ' % strvaps
        it['value_add_protection'] = strvaps

        # 白條分期
        stagings = site.xpath('//div[@class="baitiao-list J-baitiao-list"]/div[@class="item"]/a/strong/text()')
        strstaging = ''
        for st in stagings:
            ststr =str(st.extract())
            strstaging += ststr.strip() + ' '
        print '白天分期:%s ' % strstaging
        it['staging'] = strstaging

        it_list.append(it)
        return it_list

2、SplashTestItem

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class SplashTestItem(scrapy.Item):
    #單價
    price = scrapy.Field()
    # description = Field()
    #促銷
    promotion = scrapy.Field()
    #增值業務
    value_add = scrapy.Field()
    #重量
    quality = scrapy.Field()
    #選擇顏色
    color = scrapy.Field()
    #選擇版本
    version = scrapy.Field()
    #購買方式
    buy_style=scrapy.Field()
    #套裝
    suit =scrapy.Field()
    #增值保障
    value_add_protection = scrapy.Field()
    #白天分期
    staging = scrapy.Field()
    # post_view_count = scrapy.Field()
    # post_comment_count = scrapy.Field()
    # url = scrapy.Field()

3、SplashTestPipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json

class SplashTestPipeline(object):
    def __init__(self):
        # self.file = open('data.json', 'wb')
        self.file = codecs.open(
            'spider.txt', 'w', encoding='utf-8')
        # self.file = codecs.open(
        #     'spider.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

4、settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for splash_test project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
ITEM_PIPELINES = {
        'splash_test.pipelines.SplashTestPipeline':300
        }
BOT_NAME = 'splash_test'

SPIDER_MODULES = ['splash_test.spiders']
NEWSPIDER_MODULE = 'splash_test.spiders'

SPLASH_URL = 'http://192.168.99.100:8050'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'splash_test (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'splash_test.middlewares.SplashTestSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'splash_test.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'splash_test.pipelines.SplashTestPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

1. 使用SecureCRT連線docker

下載並安裝secureCRT，在連線對話方塊輸入docker的地址：預設是192.168.99.100，使用者名稱:docker，密碼：tcuser

在docker中安裝和執行splash

1、 docker中安裝splash

通過SecureCRT連線到docker機器輸入

#從docker hub下載相關映象檔案
sudo docker pull scrapinghub/splash

這裡需要注意的是由於docker hub的軟體倉庫不在國內，下載或許需要不少時間，若無法忍受請自行使用代理服務或者其他映象倉庫

2. 啟動splash服務

使用docker啟動服務命令啟動Splash服務

#啟動splash服務，並通過http，https，telnet提供服務
#通常一般使用http模式 ，可以只啟動一個8050就好  
#Splash 將執行在 0.0.0.0 at ports 8050 (http), 8051 (https) and 5023 (telnet).
sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

服務啟動後，開啟瀏覽器輸入192.168.99.100:8050檢視服務啟動情況

輸入www.baidu.com,點選Render me 按鈕，立馬可以看見在伺服器端渲染後的百度頁面

3. Splash使用

Splash 本身支援進行頁面的過濾，具體規則模式和Adblock Plus的規則模式一致，我們可以通過直接下載Adblock Plus的過濾規則來對頁面進行過濾，或者為了提高頁面的載入和渲染速度，可以通過設定過濾規則來遮蔽一些不想下載的內容，比如圖片，視訊等。一般可以首先下載Adblock Plus的規則，遮蔽掉廣告

#設定一個本地目錄對映為docker中 splash的檔案目錄，用於類似adblock plus的廣告過濾
#<my-filters-dir>：是一個本地資料夾，注意這裡的本地是宿主哦，不是windows哦
#同時設定adblock過濾器目錄為/etc/splash/filters
$ docker run -p 8050:8050 -v <my-filters-dir>:/etc/splash/filters scrapinghub/splash  --filters-path=/etc/splash/filters

下圖是沒有載入過濾器的新浪首頁樣子

下圖是使用過濾器後新浪首頁的樣子

splash請求附帶引數的一些設定

class FlySpider(scrapy.Spider):
    name = "FlySpider"
    house_pc_index_url='xxxxx'

    def __init__(self):
        client = MongoClient("mongodb://name:[email protected]:27017/myspace")
        db = client.myspace
        self.fly = db["fly"]

    def start_requests(self):


        for x in xrange(0,1):
            try:
                script = """
                function process_one(splash)
                    splash:runjs("$('#next_title').click()")
                    splash:wait(1)
                    local content=splash:evaljs("$('.scrollbar_content').html()")
                    return content
                end
                function process_mul(splash,totalPageNum)
                    local res={}
                    for i=1,totalPageNum,1 do
                        res[i]=process_one(splash)
                    end
                    return res
                end
                function main(splash)
                    splash.resource_timeout = 1800
                    local tmp=splash:get_cookies()
                    splash:add_cookie('PHPSESSID', splash.args.cookies['PHPSESSID'],"/", "www.feizhiyi.com")
                    splash:add_cookie('FEIZHIYI_LOGGED_USER', splash.args.cookies['FEIZHIYI_LOGGED_USER'],"/", "www.feizhiyi.com" )
                    splash:autoload("http://cdn.bootcss.com/jquery/2.2.3/jquery.min.js")
                    assert(splash:go{
                        splash.args.url,
                        http_method=splash.args.http_method,
                        headers=splash.args.headers,
                    })
                    assert(splash:wait(splash.args.wait) )
                    return {res=process_mul(splash,100)}

                end
                """
                agent = random.choice(agents)
                print "------cookie---------"
                headers={
                    "User-Agent":agent,
                    "Referer":"xxxxxxx",
                }
                splash_args = {
                    'wait': 3,
                    "http_method":"GET",
                    # "images":0,
                    "timeout":1800,
                    "render_all":1,
                    "headers":headers,
                    'lua_source': script,
                    "cookies":cookies,
                    # "proxy":"http://101.200.153.236:8123",
                }
                yield SplashRequest(self.house_pc_index_url, self.parse_result, endpoint='execute',args=splash_args,dont_filter=True)
                # +"&page="+str(x+1)
            except Exception, e:
                print e.__doc__
                print e.message
                pass

scrapy splash 實現下滑載入

實現滾軸下拉載入頁面的splash script(Lua 指令碼)

方法1
function main(splash, args)  
  splash:set_viewport_size(1028, 10000)  
  splash:go(args.url)  
  local scroll_to = splash:jsfunc("window.scrollTo")  
  scroll_to(0, 2000)  
  splash:wait(5)  
  return {png=splash:png()}  
end 

方法2
function main(splash, args)  
  splash:set_viewport_size(1028, 10000)  
  splash:go(args.url)  
  splash.scroll_position={0,2000}  
  splash:wait(5)  
  return {png=splash:png()}  
end

爬蟲實現下滑載入

def start_requests(self):  
    script = """ 
            function main(splash) 
                splash:set_viewport_size(1028, 10000) 
                splash:go(splash.args.url) 
                local scroll_to = splash:jsfunc("window.scrollTo") 
                scroll_to(0, 2000) 
                splash:wait(15) 
                return { 
                    html = splash:html() 
                } 
            end 
            """  

    for url in self.start_urls:  
        yield Request(url,callback=self.parse_info_index,meta = {  
            'dont_redirect': True,  
            'splash':{  
                'args':{'lua_source':script,'images':0},  
                'endpoint':'execute',  

            }  
        })

爬蟲之scrapy-splash

什麼是splash

1、利用pip安裝scrapy-splash庫：

2、`pip install scrapy-splash`

5、拉取映象(pull the image)：

6、用docker執行scrapinghub/splash服務：

7、配置splash服務（以下操作全部在settings.py）：

8、正式抓取

9、執行splash服務

10、執行scrapy crawl scrapy_splash

11、抓取資料

12、完整原始碼

1. 使用SecureCRT連線docker

在docker中安裝和執行splash

2. 啟動splash服務

3. Splash使用

爬蟲之scrapy-splash

2017.07.26 Python網絡爬蟲之Scrapy爬蟲框架

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報的數據存儲問題

爬蟲之Scrapy

python爬蟲之scrapy的pipeline的使用

python爬蟲之scrapy文件下載

python爬蟲之scrapy模擬登錄

皇冠體育二代信用盤帶手機版網絡爬蟲之scrapy框架詳解

爬蟲之scrapy框架

2018 - Python 3.7 爬蟲之 Scrapy 框架的安裝及配置（一）

python爬蟲之scrapy中介軟體介紹

Python 爬蟲之 Scrapy 分散式原理以及部署

網路爬蟲之scrapy爬取某招聘網手機APP釋出資訊

16.Python網路爬蟲之Scrapy框架（CrawlSpider）

Python網路爬蟲之scrapy爬蟲的基本使用

爬蟲之scrapy工作流程

Python爬蟲之scrapy框架爬蟲步驟

18、python網路爬蟲之Scrapy框架中的CrawlSpider詳解

網路爬蟲之Scrapy實戰二：爬取多個網頁

爬蟲之scrapy-splash

什麼是splash

1、利用pip安裝scrapy-splash庫：

2、pip install scrapy-splash

5、拉取映象(pull the image)：

6、用docker執行scrapinghub/splash服務：

7、配置splash服務（以下操作全部在settings.py）：

8、正式抓取

9、執行splash服務

10、執行scrapy crawl scrapy_splash

11、抓取資料

12、完整原始碼

1. 使用SecureCRT連線docker

在docker中安裝和執行splash

2. 啟動splash服務

3. Splash使用

相關推薦

2、`pip install scrapy-splash`