htmlUnit加持，網路小蜘蛛的超級進化

阿新 • • 發佈：2018-12-18

前言

　　前段時間寫了個小說線上採集閱讀（猛戳這裡：https://www.cnblogs.com/huanzi-qch/p/9817831.html），當我們去採集起點網的小說目錄時發現目錄資料沒有在html裡面，資料是頁面載入時，用ajax請求獲取，且對應的div是隱藏的，需要點選“目錄”，才看到目錄，雖然經過研究最終我們還是找到了介面URL，並通過HttpClient構造post請求獲取到了資料，但這種方式太麻煩，成本太大，那有沒有其他的方式呢？

htmlUnit簡單介紹

　　以下介紹摘自官網：　　HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

　　It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.

　　It is typically used for testing purposes or to retrieve information from web sites.

　　HtmlUnit is not a generic unit testing framework. It is specifically a way to simulate a browser for testing purposes and is intended to be used within another testing framework such as JUnit or TestNG. Refer to the document "Getting Started with HtmlUnit" for an introduction.

　　HtmlUnit is used as the underlying "browser" by different Open Source tools like Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit, ...

　　HtmlUnit was originally written by Mike Bowler of Gargoyle Software and is released under the Apache 2 license. Since then, it has received many contributions from other developers, and would not be where it is today without their assistance.

　　HtmlUnit provides excellent JavaScript support, simulating the behavior of the configured browser (Firefox or Internet Explorer). It uses the Rhino JavaScript engine for the core language (plus workarounds for some Rhino bugs) and provides the implementation for the objects specific to execution in a browser.

　　中文翻譯：　　HtmlUnit是一個“Java程式的無介面瀏覽器”。它為HTML文件建模，並提供一個API，允許您呼叫頁面、填寫表單、單擊連結等……就像你在“普通”瀏覽器中所做的一樣。

　　它有相當好的JavaScript支援(不斷改進)，甚至可以使用非常複雜的AJAX庫，根據使用的配置模擬Chrome、Firefox或Internet Explorer。

　　它通常用於測試或從web站點檢索資訊。

　　HtmlUnit不是一個通用的單元測試框架。它是一種專門用於測試目的的模擬瀏覽器的方法，並打算在其他測試框架(如JUnit或TestNG)中使用。請參閱“開始使用HtmlUnit”文件以獲得介紹。

　　HtmlUnit被不同的開源工具用作底層的“瀏覽器”，比如Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit…

　　HtmlUnit最初是由石像鬼軟體的Mike Bowler編寫的，在Apache 2許可證下發布。從那以後，它收到了其他開發者的許多貢獻，如果沒有他們的幫助，它就不會有今天的成就。

　　HtmlUnit提供了出色的JavaScript支援，模擬了配置好的瀏覽器(Firefox或Internet Explorer)的行為。它使用Rhino JavaScript引擎作為核心語言(加上一些Rhino bug的解決方案)，併為特定於在瀏覽器中執行的物件提供實現。

程式碼編寫

　　maven引包：

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.32</version>
</dependency>

　　那對應我們之前獲取目錄，我們可以這樣做：

　　　　　　try {
            //建立一個WebClient，並模擬特定的瀏覽器
            WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52);

            //幾個重要配置
            webClient.getOptions().setJavaScriptEnabled(true);//啟用js
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());//設定Ajax非同步
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);//丟擲失敗的狀態碼
            webClient.getOptions().setThrowExceptionOnScriptError(true);//丟擲js異常
            webClient.getOptions().setCssEnabled(false);//禁用css，無頁面，無需渲染
            webClient.getOptions().setTimeout(10000); //設定連線超時時間

            //獲取起點中文網書本詳情、目錄頁面
            HtmlPage page = webClient.getPage("https://book.qidian.com/info/1209977");

            //設定等待js響應時間
            webClient.waitForBackgroundJavaScript(5000);

            //模擬點選“目錄”
            page = page.getHtmlElementById("j_catalogPage").click();

            //獲取頁面原始碼
            System.out.println(page.asXml());
        } catch (IOException e) {
            e.printStackTrace();
        }

效果展示

　　未執行js之前

　　經過執行js請求渲染資料，再獲取頁面原始碼，這樣我們就能拿到帶有目錄資料的html了

htmlUnit加持，網路小蜘蛛的超級進化

前言

htmlUnit簡單介紹

程式碼編寫

效果展示

結束語

htmlUnit加持，網路小蜘蛛的超級進化

大資料加持，醫美行業“原來你可以更美”

八年技術加持，效能提升10倍，阿里雲HBase 2.0首發商用

O2OA新版本v20181101173308釋出! AI加持，智慧更穩定

八年技術加持，性能提升10倍，阿裏雲HBase 2.0首發商用

Centos 7 下添加文件系統ntfs 支持，添加windows系統引導

TTP代理原裝TTP233D-RB6 DFN6封裝薄，體積小 TTP原廠工程服務技術支持

【BZOJ2229】[ZJOI2011]最小割（網路流，最小割樹）

Blockathon上海|年紀最小且是唯一一支女子團隊率先獲得投資人青睞與加持

200萬客服獎金加持！港仔文藝男，就是這麼任性！

仁潤雲丨網路小貸風控資料介面分析（多頭借貸，芝麻信用）

【POJ - 2226】Muddy Fields（匈牙利演算法或網路流dinic，二分圖匹配，最小點覆蓋，矩陣中優秀的建圖方式）

POJ2516 Minimum Cost(網路流，最小費用最大流)

“剁手”的第十年，AI加持下的快遞速度你還滿意嗎？

騰訊 Omi 5.0 釋出 - Web 前端 MVVM 王者歸來，mappingjs 強力加持

[ZJOI2010]網路擴容，P2604，最小費用最大流

淺談網路流（最大流，最小割，mcmf，最大匹配）

網路流之最小割hihocoder116，最小割==最大流，點屬於的割集，最小割性質，關鍵割邊，最小割邊

酷客多技術支持，攜程小程序上線！

某鑑黃師：“我並不幸福”，AI加持下CDN鑑黃又是何物？

htmlUnit加持，網路小蜘蛛的超級進化

前言

htmlUnit簡單介紹

程式碼編寫

效果展示

結束語

相關推薦