Browsermob-Proxy（Selenium）爬取瀏覽器獲取Har資訊（含例項）

阿新 • • 發佈：2021-02-05

技術標籤：爬蟲

使用Selenium 爬取網頁內容時，非同步載入、js加密、動態Cookie等問題都非常簡單的能夠獲取或提交表單。

很多網站資料都是通過json 結構來互動，直接分析json 結構資料不但全而且很好解析，這比解析 html 網頁簡單很多。而且很多時候一些介面返回的關鍵資訊不在 html 網頁上顯示，直接通過 selenium 獲得的這些網頁就沒有這些資訊。

selenium + webdriver雖然能夠定位DOM元素、操作網頁、獲取網頁等，但是 Selenium 只能處理“結果”，它無法得知瀏覽器請求的資料介面資訊。若是能夠像瀏覽器 Network 那樣獲取所有介面的請求和返回資訊，就能夠獲取一些關鍵資訊。

本文使用BrowserMob-Proxy 來解決這個問題。使用 webdriver 通過 proxy 訪問網路，再收集 proxy 端的請求和返回內容，從而獲取到資料，而這個 proxy 就類似於 fiddler 抓包軟體。

1、安裝Browsermob-Proxy

（1）、 pip3 install BrowserMob-Proxy

（2）、 下載java端BrowserMob-Proxy包：http://bmp.lightbody.net/

（3）、 安裝java8環境

2、實戰

這裡以百度為例。使用 Selenium + Webdriver + Browsermob-Proxy

獲取介面返回的資料。

from browsermobproxy import Server
from selenium import webdriver
import time
import pprint

class ProxyManger:

    __BMP = "D:/AzRjN/browsermob_proxy/browsermob-proxy-2.1.4/bin/browsermob-proxy.bat"

    def __init__(self):

        self.__server = Server(ProxyManger.__BMP)
        self. 
__client = None

    def start_server(self):
        self.__server.start()
        return self.__server

    def start_client(self):

        self.__client = self.__server.create_proxy(params={"trustAllServers": "true"})
        return self.__client

    @property
    def client(self):
        return self.__client

    @property
    def server(self):
        return self.__server

if __name__=="__main__":
    # 開啟Proxy
    proxy = ProxyManger()
    server = proxy.start_server()
    client = proxy.start_client()

    # 配置Proxy啟動WebDriver
    options = webdriver.ChromeOptions()
    options.add_argument("--proxy-server={}".format(client.proxy))
    options.add_argument('--ignore-certificate-errors')
    chromePath = r"D:\AzRjN\anaconda3_7\envs\demo36\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe"
    driver = webdriver.Chrome(executable_path=chromePath, chrome_options=options)

    # 獲取返回的內容
    client.new_har("baidu.com")
    driver.get("https://www.baidu.com/")
    time.sleep(3)

    newHar = client.har
    pprint.pprint(newHar)
    server.stop()

通過 Har 就能獲取瀏覽器所有的請求，然後過濾出資料介面就OK，而且拿到的結構和在瀏覽器開發者模式 Network中看到的是一樣的。

Browsermob-Proxy（Selenium）爬取瀏覽器獲取Har資訊（含例項）

1、安裝Browsermob-Proxy

2、實戰

Browsermob-Proxy（Selenium）爬取瀏覽器獲取Har資訊（含例項）

python+selenium定時爬取丁香園的新型冠狀病毒資料並製作出類似的地圖（部署到雲伺服器）

web爬蟲系列（一）- 爬取電影天堂迅雷地址

python爬取高匿代理IP（再也不用擔心會進小黑屋了）

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

Request爬取各類網站的資料（例項爬取）

桌布不嫌棄多，今天帶你爬取動漫桌布網站（福利哦）

selenium模擬瀏覽器爬取淘寶產品資訊

爬蟲筆記（三）爬取‘糗事百科’熱圖板塊所有圖

一篇文章教會你用Python爬取淘寶評論資料（寫在記事本）

python爬蟲學習：從資料庫讀取目標爬蟲站點及爬蟲規程，批量爬取目標站點制定資料（scrapy框架）

簡單爬蟲爬取起點中文網小說（僅學習）

python爬蟲實戰之爬取任意百度圖片（升級版）

爬蟲-Scrapy（二）爬取糗百笑話-單頁

python爬蟲學習（一）爬取高清桌布（各種主流大小的高清圖）

零基礎快速入門（二）爬取豆瓣電影——python爬蟲例項

python爬蟲爬取網易雲音樂（超詳細教程，附原始碼）

Python爬蟲實戰，requests+xlwt模組，爬取螺螄粉商品資料（附原始碼）

Python3 使用selenium外掛爬取蘇寧商家聯絡電話

Python進階之使用selenium爬取淘寶商品資訊功能示例

Browsermob-Proxy（Selenium）爬取瀏覽器獲取Har資訊（含例項）

1、安裝Browsermob-Proxy

2、實戰

相關推薦