1. 程式人生 > >htmlUnit加持,網路小蜘蛛的超級進化

htmlUnit加持,網路小蜘蛛的超級進化

前言

  前段時間寫了個小說線上採集閱讀(猛戳這裡:https://www.cnblogs.com/huanzi-qch/p/9817831.html),當我們去採集起點網的小說目錄時發現目錄資料沒有在html裡面,資料是頁面載入時,用ajax請求獲取,且對應的div是隱藏的,需要點選“目錄”,才看到目錄,雖然經過研究最終我們還是找到了介面URL,並通過HttpClient構造post請求獲取到了資料,但這種方式太麻煩,成本太大,那有沒有其他的方式呢?

htmlUnit簡單介紹

  以下介紹摘自官網:  HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

  It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.

  It is typically used for testing purposes or to retrieve information from web sites.

  HtmlUnit is not a generic unit testing framework. It is specifically a way to simulate a browser for testing purposes and is intended to be used within another testing framework such as JUnit or TestNG. Refer to the document "Getting Started with HtmlUnit" for an introduction.

  HtmlUnit is used as the underlying "browser" by different Open Source tools like Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit, ...

  HtmlUnit was originally written by Mike Bowler of Gargoyle Software and is released under the Apache 2 license. Since then, it has received many contributions from other developers, and would not be where it is today without their assistance.

  HtmlUnit provides excellent JavaScript support, simulating the behavior of the configured browser (Firefox or Internet Explorer). It uses the Rhino JavaScript engine for the core language (plus workarounds for some Rhino bugs) and provides the implementation for the objects specific to execution in a browser.

  中文翻譯:  HtmlUnit是一個“Java程式的無介面瀏覽器”。它為HTML文件建模,並提供一個API,允許您呼叫頁面、填寫表單、單擊連結等……就像你在“普通”瀏覽器中所做的一樣。

  它有相當好的JavaScript支援(不斷改進),甚至可以使用非常複雜的AJAX庫,根據使用的配置模擬Chrome、Firefox或Internet Explorer。

  它通常用於測試或從web站點檢索資訊。

  HtmlUnit不是一個通用的單元測試框架。它是一種專門用於測試目的的模擬瀏覽器的方法,並打算在其他測試框架(如JUnit或TestNG)中使用。請參閱“開始使用HtmlUnit”文件以獲得介紹。

  HtmlUnit被不同的開源工具用作底層的“瀏覽器”,比如Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit…

  HtmlUnit最初是由石像鬼軟體的Mike Bowler編寫的,在Apache 2許可證下發布。從那以後,它收到了其他開發者的許多貢獻,如果沒有他們的幫助,它就不會有今天的成就。

  HtmlUnit提供了出色的JavaScript支援,模擬了配置好的瀏覽器(Firefox或Internet Explorer)的行為。它使用Rhino JavaScript引擎作為核心語言(加上一些Rhino bug的解決方案),併為特定於在瀏覽器中執行的物件提供實現。

程式碼編寫 

  maven引包:

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.32</version>
</dependency>

  那對應我們之前獲取目錄,我們可以這樣做:

      try {
            //建立一個WebClient,並模擬特定的瀏覽器
            WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52);

            //幾個重要配置
            webClient.getOptions().setJavaScriptEnabled(true);//啟用js
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());//設定Ajax非同步
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);//丟擲失敗的狀態碼
            webClient.getOptions().setThrowExceptionOnScriptError(true);//丟擲js異常
            webClient.getOptions().setCssEnabled(false);//禁用css,無頁面,無需渲染
            webClient.getOptions().setTimeout(10000); //設定連線超時時間

            //獲取起點中文網書本詳情、目錄頁面
            HtmlPage page = webClient.getPage("https://book.qidian.com/info/1209977");

            //設定等待js響應時間
            webClient.waitForBackgroundJavaScript(5000);

            //模擬點選“目錄”
            page = page.getHtmlElementById("j_catalogPage").click();

            //獲取頁面原始碼
            System.out.println(page.asXml());
        } catch (IOException e) {
            e.printStackTrace();
        }

效果展示

  未執行js之前

   經過執行js請求渲染資料,再獲取頁面原始碼,這樣我們就能拿到帶有目錄資料的html了

 結束語