WebMagic爬蟲（二）升級版

阿新 • • 發佈：2018-12-19

我在爬取頁面的時候發現有很多資料是js渲染進去的，通過：

String htm = page.getHtml().xpath("*/html/html()").toString();

page.putField("html",htm);

就可以看到爬取下來的頁面資料，可以很清晰的看出頁面裡面有沒有自己想要的資料，如果沒有，那麼我們就需要進一步操作！

如果資料是ajax請求過來的，那麼可以參考webmagic開發者的ajax爬取方法。我想爬取的頁面資料是直接js渲染的，使用了

PhantomJ渲染之後爬取。

簡單來說，PhantomJ就是一個網頁瀏覽器。

當然我們也可以直接用谷歌瀏覽器代替PhantomJ，但會爬取很慢。

一：先講一下谷歌瀏覽器怎麼渲染爬取

用谷歌瀏覽器必須先下載一個谷歌的瀏覽器驅動，下載地址：http://chromedriver.storage.googleapis.com/index.html

要注意驅動對應谷歌版本：https://blog.csdn.net/llbacyal/article/details/78563992

demo:

package spider;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;

import java.io.File;
import java.io.IOException;


/**
 * chromeDriver是谷歌的瀏覽器驅動，用來適配Selenium,有圖形頁面存在，在除錯爬蟲下載執行的功能的時候會相對方便
 * @author zhuangj
 * @date 2017/11/14
 */
public class TestChromeDriver {

    private static ChromeDriverService service;

    public static WebDriver getChromeDriver() throws IOException {
        System.setProperty("webdriver.chrome.driver","C:/Users/Administrator/AppData/Local/Google/Chrome/Application/chrome.exe");
        // 建立一個 ChromeDriver 的介面，用於連線 Chrome（chromedriver.exe 的路徑可以任意放置，只要在newFile（）的時候寫入你放的路徑即可）
        service = new ChromeDriverService.Builder().usingDriverExecutable(new File("F:\\chromedriver.exe")) .usingAnyFreePort().build();
        service.start();
        // 建立一個 Chrome 的瀏覽器例項
        return new RemoteWebDriver(service.getUrl(), DesiredCapabilities.chrome());
    }

    public static void main(String[] args) throws IOException {

        WebDriver driver = TestChromeDriver.getChromeDriver();
        // 讓瀏覽器訪問 Baidu
        driver.get("https://www.taobao.com/");
        // 用下面程式碼也可以實現
        //driver.navigate().to("http://www.baidu.com");
        // 獲取 網頁的 title
        System.out.println(" Page title is: " +driver.getTitle());
        // 通過 id 找到 input 的 DOM
        WebElement element =driver.findElement(By.id("q"));
        // 輸入關鍵字
        element.sendKeys("東鵬瓷磚");
        // 提交 input 所在的 form
        element.submit();
        // 通過判斷 title 內容等待搜尋頁面載入完畢，間隔秒
        new WebDriverWait(driver, 10).until(new ExpectedCondition() {
            @Override
            public Object apply(Object input) {
                return ((WebDriver)input).getTitle().toLowerCase().startsWith("東鵬瓷磚");
            }
        });
        // 顯示搜尋結果頁面的 title
        System.out.println(" Page title is: " +driver.getTitle());
        // 關閉瀏覽器
        driver.quit();
        // 關閉 ChromeDriver 介面
        service.stop();
    }

}

二：PhantomJ抓取

先下載一個PhantomJ，下載地址：http://phantomjs.org/download.html

linux安裝方法：https://www.cnblogs.com/yestreenstars/p/5511212.html安裝的時候注意最好是安裝最新版本的可以從上面連結找最新版本號

demo:

package spider;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;

/**
 * PhantomJs是一個基於webkit核心的無頭瀏覽器，即沒有UI介面，即它就是一個瀏覽器，只是其內的點選、翻頁等人為相關操作需要程式設計實現;
 * 因為爬蟲如果每次爬取都呼叫一次谷歌瀏覽器來實現操作,在效能上會有一定影響,而且連續開啟十幾個瀏覽器簡直是記憶體噩夢,
 * 因此選用phantomJs來替換chromeDriver
 * PhantomJs在本地開發時候還好，如果要部署到伺服器，就必須下載linux版本的PhantomJs,相比window操作繁瑣
 * @author zhuangj
 * @date 2017/11/14
 */
public class TestPhantomJsDriver {


    public static PhantomJSDriver getPhantomJSDriver() throws Exception{
        //設定必要引數
        DesiredCapabilities dcaps = new DesiredCapabilities();
        //ssl證書支援
        dcaps.setCapability("acceptSslCerts", true);
        //截圖支援
        dcaps.setCapability("takesScreenshot", false);
        //css搜尋支援
        dcaps.setCapability("cssSelectorsEnabled", true);
        //js支援
        dcaps.setJavascriptEnabled(true);
        //驅動支援
        dcaps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,"F:\\phantomjs\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe");

        PhantomJSDriver driver = new PhantomJSDriver(dcaps);
        return  driver;
    }

    public static void main(String[] args) throws Exception {
        WebDriver driver=getPhantomJSDriver();
        driver.get("https://www.886er.com/vod/17/11675-1.html");

        WebElement webElement = driver.findElement(By.xpath("/html"));
        System.out.println("outerHTML======="+webElement.getAttribute("outerHTML"));
        System.out.println("url========="+driver.getCurrentUrl());
        driver.quit();
    }
}

以上demo感謝作者：https://www.cnblogs.com/null-qige/p/7844381.html

這樣簡單的渲染資料就基本是都可以爬取到了，可以直接整合到WebMagic裡面去，當然WebMagic作者也整合了這一模組，我用的時候先是自己整了一下發現還是沒有WebMagic作者的好用，接下來講一下WebMagic裡面這個模組。

首先我們需要下載原始碼：https://gitee.com/flashsword20/webmagic.git

因為這塊業務太重所以原始碼作者沒有把這一模組放到maven裡面去，我們需要自己下載原始碼稍作修改

我們下載好原始碼後代開這個模組：