Copy 這段程式碼，媽媽再也不用擔心我爬不到資料

阿新 • • 發佈：2019-12-31

先說明，這個標題有點標題黨的成分，並不是所有資料都能爬，只是那些以正常手段、可以通過瀏覽器訪問獲取到的資料。

同時，這段程式碼沒什麼 Magic 的地方，用到的核心技術都是開源的：

GoogleChrome/puppeteer

miyakogi/pyppeteer

使用了 puppeteer 的 page.on('event',callback) API：

page.on('response',intercept_response)

當瀏覽器收到 response 的時候，就會呼叫這個回撥函式。而這，就具備了爬取所有資料的潛力。

0x01 先上程式碼

至於為什麼不用node，那完全是因為我node不熟練 ?

import asyncio
from pyppeteer import launch
from pyppeteer.network_manager import Response
from pyppeteer.page import Page

async def crawl_response(
        start_url: str,actions: list,response_match_callback: callable = None,response_handle_callback: callable = None,):
    """

    A highly abstracted function can perform amost any puppeteer based spider task.
    Ignore the existence of any js encryption.

    :param start_url: init page url
    :param actions: a list of actions to perform on a Page object,each action should accept exact one argument.

        for example:

        async def click_by_xpath(page: Page):
            await asyncio.sleep(3)
            elemHandlers = await page.xpath(xpath)
            elemHandler = elemHandlers[0]
            await elemHandler.click()

    :param response_match_callback: a callback function determine whether should take actions on a response.
        this function should be a `async` function and accept exact one argument.

        for example:

        1. match all response
            lambda res: True
        2. match response with 'api' in its url
            lambda res: "api" in res.url
        3. match all xhr and fetch response
            def response_match_callback(res : Response):
                resourceType = res.request.resourceType
                    if resourceType in ['xhr','fetch']:
                        return True

    :param response_handle_callback: for those response match response_match_callback.
        this function should be a `async` function and accept exact one argument.

        for example:

        1. simply print response text
        async def response_handle_callback(res: Response):
            text = await res.text()
            print(text)

        2. save response to filesystem
        async def response_handle_callback(res: Response):
            text = await res.text()
            with open('example.json','w',encoding = 'utf-8') as f:
                f.write(text)
    :return:
    """ 


    async def intercept_response(res: Response):
        if response_match_callback:
            match = await response_match_callback(res)
            if match:
                await response_handle_callback(res)

    browser = await launch({
        'headless': False,'devtools': False,'args': [
            '-start-maximized' 
,'--disable-extensions','--hide-scrollbars','--disable-bundled-ppapi-flash','--mute-audio','--no-sandbox','--disable-setuid-sandbox','--disable-gpu','--disable-infobars','--enable-automation',],'dumpio': True,})

    try:
        page = await browser.newPage()
        await page.goto(start_url)
        page.on('response',intercept_response)
        for task_action in actions:
            await task_action(page)
    finally:
        await browser.close()

複製程式碼

如果你使用的閱讀裝置不方便看程式碼，可以看下面這張圖片：

0x02 對比下兩種爬蟲模式

思維模式

傳統的web爬蟲：Developer Tools → Network → 分析請求 → 如果有加密引數，js打斷點 → 用指令碼語言如python模擬請求。

基於 webdriver 的爬蟲：定義過濾、處理請求的回撥函式 → 訪問起始url → 對頁面進行一系列操作（點選、跳轉、滾動等）

優劣對比

傳統web爬蟲

優勢：

只要破解了加密引數，爬取效率更高。
任務排程、程式碼組織更靈活簡單。

劣勢：

前期分析非常耗時，而且不一定能夠破解加密引數。
如果哪天加密方式變了，需要重新破解。

基於 webdriver 的爬蟲

優勢：

無視任何js加密手段，只要是普通使用者能訪問到的資料，就一定可以無腦獲取到。即使網站js加密手段變了，也沒影響。
編寫簡單，只需要定義一系列頁面操作就行了。
能夠一次性把所有請求過程中獲取的資料全給儲存了，方便後面

劣勢：

效率低是肯定的。
有時候必須要使用有頭瀏覽器的時候，部署起來就麻煩了。

0x03 分析一下上面給出的程式碼

async def crawl_response(
    start_url: str,):
	pass
複製程式碼

接收四個引數：

start_url : 初始訪問連結
actions：對頁面進行的一系列操作
response_match_callback : 一個函式，接收 response 為引數，判斷是否要對這個 response 進行操作
response_handle_callback：一個函式，對 response 進行具體的處理，比如儲存資料庫等。

你會發現，我們日常生活中，進行的所有瀏覽器瀏覽行為，都可以總結為：先訪問一個起始頁面，然後進行一系列操作。

而 puppeteer 有能力把訪問過程中的所有資料儲存下來。

這兩點結合起來，就使得這個函式成為了一個萬能方法。

好了，廢話不多說，也沒什麼高深的需要講解的了，下面進入實戰。

0x04 實戰一：爬取今日頭條首頁feed流資料

今日頭條

如果你開啟 developer tools，分析網路請求，不然發現feed流的介面是這個：

https://www.toutiao.com/api/pc/feed/

然後很明顯地發現，請求引數裡面有加密欄位：

我們用上面提到的函式來分析一下：

首先起始連結是 https://www.toutiao.com/
其次，只要我們點選這些側邊欄，就能觸發網路請求獲取到想要的feed流資料。

沒錯，就這麼簡單，只需要兩步。

下面是完整的程式碼：

async def main():
    # 爬取 https://www.toutiao.com/ feed 流資料
    start_url = "https://www.toutiao.com/"

    def click_by_xpath(xpath: str):
        async def _click(page: Page):
            await asyncio.sleep(3)
            elemHandlers = await page.xpath(xpath)
            elemHandler = elemHandlers[0]
            await elemHandler.click()

        return _click

    actions = []
    tabs = ["推薦","熱點","科技","娛樂","遊戲","體育","財經","搞笑"]
    for tabname in tabs:
        xpath = "//span[contains(text(),'{}')]".format(tabname)
        action = click_by_xpath(xpath)
        actions.append(action)

    async def is_feed_api(res: Response):
        return "api/pc/feed" in res.url

    async def print_response_text(res: Response):
        text = await res.text()
        print(text)

    await crawl_response(start_url,actions,is_feed_api,print_response_text)


if __name__ == '__main__':
    asyncio.run(main())

複製程式碼

0x05 實戰二：爬取拼多多t恤類商品資料

拼多多商城

分析一下：

起始url ：https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T恤&opt_type=2
不斷下滑

完整程式碼：

async def crawl_pdd():
    start_url = "https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T%E6%81%A4&opt_type=2"

    actions = []

    def scroll_down(amount: int,secs: int):
        async def _scroll_down(page: Page):
            start = time.time()
            while True:
                await page.evaluate("window.scrollBy({},0);".format(amount))
                await asyncio.sleep(2)
                if time.time() - start >= secs:
                    break

        return _scroll_down

    # 不斷下滑 30 s
    actions.append(scroll_down(500,60))

    async def print_response_text(res: Response):
        text = await res.text()
        print(text)

    await crawl_response(start_url,lambda res: 'subfenlei_gyl_label' in res.url,print_response_text)


if __name__ == '__main__':
    asyncio.run(crawl_pdd())
複製程式碼

⚠️這只是一個小demo，跑不起來很正常，有些細節還是需要自己摸索的。

0x06 總結一下

通過這篇文章，我希望能夠帶給大家兩點收穫：

一種新的爬蟲手段，當你破解js無果時，可以試試這張方法。
學會抽象，把複雜的問題變得簡單。這個函式，本質上就是把使用瀏覽器訪問頁面這件事，抽象成了兩步：訪問初始頁面 + 對頁面採取一系列操作。幾乎所有瀏覽行為，都可以用這兩步概括（有哪些不能用這兩步概括的，歡迎告知嗷），這也就說明瞭這個介面的強大抽象能力。

同時，歡迎你也可以參照示例，用到其他網站上去。

PS : 有什麼好的想法，或者就是交個朋友，歡迎加我的個人微信，一起交流。