java 動態載入的頁面資料的抓取

阿新 • • 發佈：2019-01-25

動態載入的頁面資料的抓取

動態載入頁面資料有兩種方法可以選擇：

1模擬頁面中的請求，直接獲取介面返回的資料
2內建瀏覽器渲染頁面，然後獲取渲染後的資料
分析
在頁面中通過拼湊引數等方法來模擬網路請求，最終獲取介面資料，這種方法是可以行的通的，問題是比較麻煩。本文主要通過內建瀏覽器渲染這種簡單粗暴的方法來實現資料的抓取。

問題來了，如何內建瀏覽器呢？

熟悉自動化測試同學應該都知道 Selenium ，這個模擬瀏覽器進行自動化測試的工具。Selenium 提供一組 API 可以與真實的瀏覽器核心互動。Selenium 是跨語言的，有 Java、C#、python 等版本，並且支援多種瀏覽器，chrome、firefox 以及 IE 都支援。

實現
我們用 Java 來寫 Demo。

新增依賴

新增 Selenium 依賴，以 Maven 為例：

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.0.1</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId 
>
    <artifactId>selenium-chrome-driver</artifactId>
    <version>3.0.1</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-server</artifactId>
    <version>2.18.0</version>
</dependency 
>

下載 driver

以 chrome 為例： https://sites.google.com/a/chromium.org/chromedriver/

下載後，最好新增環境變數。當然，也可以在呼叫前設定環境:

System.getProperties().setProperty("webdriver.chrome.driver",
    "/Users/zhenguo/Documents/chrome/chromedriver");

注意：Mac環境下需要確認 chromedriver 是可執行的。

安裝 Chrome 瀏覽器

測試 selenium ，程式碼如下：

@Ignore("need chrome driver")
@Test
public void testSelenium() {
    System.getProperties().setProperty("webdriver.chrome.driver",
        "/Users/zhenguo/Documents/chrome/chromedriver");
    WebDriver webDriver = new ChromeDriver();
    webDriver.get("http://huaban.com/");
    WebElement webElement = webDriver.findElement(By.xpath("/html"));
    System.out.println(webElement.getAttribute("outerHTML"));
    webDriver.close();
}

如果出現類似以下結果，就說明 webdriver 配置好了：

Starting ChromeDriver 2.25.426935 (820a95b0b81d33e42712f9198c215f703412e1a1) on port 2052
Only local connections are allowed.
Nov 07, 2016 12:35:11 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
Nov 07, 2016 12:35:13 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: OSS

PS：每次 new ChromeDriver() ，Selenium都會建立一個Chrome程序，並使用一個隨機埠在Java中與chrome程序進行通訊來互動。我們需要呼叫 webDriver.close() 關閉程序。如果是網路爬蟲抓取資料的話，最好用執行緒池來處理。

實現爬蟲

上面步驟都設定好了，基於 webmagic 的爬蟲實現就比較簡單了，程式碼如下：

public class HuabanProcessor implements PageProcessor {

    private Site site;

    @Override
    public void process(Page page) {
           page.addTargetRequests(
             page.getHtml().links().regex("http://huaban\\.com/.*").all());
        if (page.getUrl().toString().contains("pins")) {
            page.putField("img", page.getHtml().
                          xpath("//div[@id='baidu_image_holder']/img/@src").toString());
        } else {
            page.getResultItems().setSkip(true);
        }
    }

    @Override
    public Site getSite() {
        if (null == site) {
            site = Site.me().setDomain("huaban.com").setSleepTime(0);
        }
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new HuabanProcessor()).thread(5)
                .addPipeline(new FilePipeline("/Users/zhenguo/Documents/chrome/webmagic/test/"))
                .setDownloader(new SeleniumDownloader("/Users/zhenguo/Documents/chrome/chromedriver"))
                .addUrl("http://huaban.com/explore/gufenghaibao/")
                .runAsync();
    }
}

上面 HuabanProcessor 使用到 SeleniumDownloader ，程式碼如下：

package us.codecraft.webmagic.downloader.selenium;

   import org.apache.log4j.Logger;
   import org.openqa.selenium.By;
   import org.openqa.selenium.Cookie;
   import org.openqa.selenium.WebDriver;
   import org.openqa.selenium.WebElement;

   import us.codecraft.webmagic.Page;
   import us.codecraft.webmagic.Request;
   import us.codecraft.webmagic.Site;
   import us.codecraft.webmagic.Task;
   import us.codecraft.webmagic.downloader.Downloader;
   import us.codecraft.webmagic.selector.Html;
   import us.codecraft.webmagic.selector.PlainText;
   import us.codecraft.webmagic.utils.UrlUtils;

   import java.io.Closeable;
   import java.io.IOException;
   import java.util.Map;

/**
 * 使用Selenium呼叫瀏覽器進行渲染。目前僅支援chrome。<br>
 * 需要下載Selenium driver支援。<br>
 *
 * @author [email protected] <br>
 *         Date: 13-7-26 <br>
 *         Time: 下午1:37 <br>
 */
 public class SeleniumDownloader implements Downloader, Closeable {

     private volatile WebDriverPool webDriverPool;

     private Logger logger = Logger.getLogger(getClass());

     private int sleepTime = 0;

     private int poolSize = 1;

     private static final String DRIVER_PHANTOMJS = "phantomjs";

     /**
      * 新建
      *
      * @param chromeDriverPath chromeDriverPath
      */
     public SeleniumDownloader(String chromeDriverPath) {
         System.getProperties().setProperty("webdriver.chrome.driver",
                 chromeDriverPath);
     }

     /**
      * Constructor without any filed. Construct PhantomJS browser
      * 
      * @author [email protected]
      */
     public SeleniumDownloader() {
         // System.setProperty("phantomjs.binary.path",
         // "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");
     }

     /**
      * set sleep time to wait until load success
      *
      * @param sleepTime sleepTime
      * @return this
      */
     public SeleniumDownloader setSleepTime(int sleepTime) {
         this.sleepTime = sleepTime;
         return this;
     }

     @Override
     public Page download(Request request, Task task) {
         checkInit();
         WebDriver webDriver;
         try {
             webDriver = webDriverPool.get();
         } catch (InterruptedException e) {
             logger.warn("interrupted", e);
             return null;
         }
         logger.info("downloading page " + request.getUrl());
         webDriver.get(request.getUrl());
         try {
             Thread.sleep(sleepTime);
         } catch (InterruptedException e) {
             e.printStackTrace();
         }
         WebDriver.Options manage = webDriver.manage();
         Site site = task.getSite();
         if (site.getCookies() != null) {
             for (Map.Entry<String, String> cookieEntry : site.getCookies()
                     .entrySet()) {
                 Cookie cookie = new Cookie(cookieEntry.getKey(),
                         cookieEntry.getValue());
                 manage.addCookie(cookie);
             }
         }

         /*
          * TODO You can add mouse event or other processes
          * 
          * @author: [email protected]
          */

         WebElement webElement = webDriver.findElement(By.xpath("/html"));
         String content = webElement.getAttribute("outerHTML");
         Page page = new Page();
         page.setRawText(content);
         page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content,
                 request.getUrl())));
         page.setUrl(new PlainText(request.getUrl()));
         page.setRequest(request);
         webDriverPool.returnToPool(webDriver);
         return page;
     }

     private void checkInit() {
         if (webDriverPool == null) {
             synchronized (this) {
                 webDriverPool = new WebDriverPool(poolSize);
             }
         }
     }

     @Override
     public void setThread(int thread) {
         this.poolSize = thread;
     }

     @Override
     public void close() throws IOException {
         webDriverPool.closeAll();
     }
 }
WebDriverPool 程式碼如下：

package us.codecraft.webmagic.downloader.selenium;

import org.apache.log4j.Logger;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;

import java.io.FileReader;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * @author [email protected] <br>
 *         Date: 13-7-26 <br>
 *         Time: 下午1:41 <br>
 */
class WebDriverPool {
    private Logger logger = Logger.getLogger(getClass());

    private final static int DEFAULT_CAPACITY = 5;

    private final int capacity;

    private final static int STAT_RUNNING = 1;

    private final static int STAT_CLODED = 2;

    private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);

    /*
     * new fields for configuring phantomJS
     */
    private WebDriver mDriver = null;
    private boolean mAutoQuitDriver = true;

    private static final String CONFIG_FILE = "/Users/zhenguo/Documents/develop/github/webmagic/webmagic-selenium/config.ini";
    private static final String DRIVER_FIREFOX = "firefox";
    private static final String DRIVER_CHROME = "chrome";
    private static final String DRIVER_PHANTOMJS = "phantomjs";

    protected static Properties sConfig;
    protected static DesiredCapabilities sCaps;

    /**
     * Configure the GhostDriver, and initialize a WebDriver instance. This part
     * of code comes from GhostDriver.
     * https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver
     * 
     * @author [email protected]
     * @throws IOException
     */
    public void configure() throws IOException {
        // Read config file
        sConfig = new Properties();
        sConfig.load(new FileReader(CONFIG_FILE));

        // Prepare capabilities
        sCaps = new DesiredCapabilities();
        sCaps.setJavascriptEnabled(true);
        sCaps.setCapability("takesScreenshot", false);

        String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

        // Fetch PhantomJS-specific configuration parameters
        if (driver.equals(DRIVER_PHANTOMJS)) {
            // "phantomjs_exec_path"
            if (sConfig.getProperty("phantomjs_exec_path") != null) {
                sCaps.setCapability(
                        PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
                        sConfig.getProperty("phantomjs_exec_path"));
            } else {
                throw new IOException(
                        String.format(
                                "Property '%s' not set!",
                                PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY));
            }
            // "phantomjs_driver_path"
            if (sConfig.getProperty("phantomjs_driver_path") != null) {
                System.out.println("Test will use an external GhostDriver");
                sCaps.setCapability(
                        PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,
                        sConfig.getProperty("phantomjs_driver_path"));
            } else {
                System.out
                        .println("Test will use PhantomJS internal GhostDriver");
            }
        }

        // Disable "web-security", enable all possible "ssl-protocols" and
        // "ignore-ssl-errors" for PhantomJSDriver
        // sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new
        // String[] {
        // "--web-security=false",
        // "--ssl-protocol=any",
        // "--ignore-ssl-errors=true"
        // });

        ArrayList<String> cliArgsCap = new ArrayList<String>();
        cliArgsCap.add("--web-security=false");
        cliArgsCap.add("--ssl-protocol=any");
        cliArgsCap.add("--ignore-ssl-errors=true");
        sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
                cliArgsCap);

        // Control LogLevel for GhostDriver, via CLI arguments
        sCaps.setCapability(
                PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,
                new String[] { "--logLevel="
                        + (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig
                                .getProperty("phantomjs_driver_loglevel")
                                : "INFO") });

        // String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

        // Start appropriate Driver
        if (isUrl(driver)) {
            sCaps.setBrowserName("phantomjs");
            mDriver = new RemoteWebDriver(new URL(driver), sCaps);
        } else if (driver.equals(DRIVER_FIREFOX)) {
            mDriver = new FirefoxDriver(sCaps);
        } else if (driver.equals(DRIVER_CHROME)) {
            mDriver = new ChromeDriver(sCaps);
        } else if (driver.equals(DRIVER_PHANTOMJS)) {
            mDriver = new PhantomJSDriver(sCaps);
        }
    }

    /**
     * check whether input is a valid URL
     * 
     * @author [email protected]
     * @param urlString urlString
     * @return true means yes, otherwise no.
     */
    private boolean isUrl(String urlString) {
        try {
            new URL(urlString);
            return true;
        } catch (MalformedURLException mue) {
            return false;
        }
    }

    /**
     * store webDrivers created
     */
    private List<WebDriver> webDriverList = Collections
            .synchronizedList(new ArrayList<WebDriver>());

    /**
     * store webDrivers available
     */
    private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();

    public WebDriverPool(int capacity) {
        this.capacity = capacity;
    }

    public WebDriverPool() {
        this(DEFAULT_CAPACITY);
    }

    /**
     * 
     * @return
     * @throws InterruptedException
     */
    public WebDriver get() throws InterruptedException {
        checkRunning();
        WebDriver poll = innerQueue.poll();
        if (poll != null) {
            return poll;
        }
        if (webDriverList.size() < capacity) {
            synchronized (webDriverList) {
                if (webDriverList.size() < capacity) {

                    // add new WebDriver instance into pool
                    try {
                        configure();
                        innerQueue.add(mDriver);
                        webDriverList.add(mDriver);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }

                    // ChromeDriver e = new ChromeDriver();
                    // WebDriver e = getWebDriver();
                    // innerQueue.add(e);
                    // webDriverList.add(e);
                }
            }

        }
        return innerQueue.take();
    }

    public void returnToPool(WebDriver webDriver) {
        checkRunning();
        innerQueue.add(webDriver);
    }

    protected void checkRunning() {
        if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
            throw new IllegalStateException("Already closed!");
        }
    }

    public void closeAll() {
        boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);
        if (!b) {
            throw new IllegalStateException("Already closed!");
        }
        for (WebDriver webDriver : webDriverList) {
            logger.info("Quit webDriver" + webDriver);
            webDriver.quit();
            webDriver = null;
        }
    }

}

到此，動態載入的頁面資料抓取就實現了。本文使用 selenium 作為渲染的方法，還有很多其他的方法，例如 phantomjs 和 htmlunit 等。有空了可以嘗試其他的方法，希望本文對你有所幫助。

java 動態載入的頁面資料的抓取

PhantomJs+MutationObserver實現動態頁面資料抓取

Python web 動態渲染頁面的抓取

C# NetCore使用AngleSharp爬取周公解夢資料 MySql資料庫的自動建立和頁面資料抓取

d3滑鼠拖拽、放大縮小後動態載入頁面資料demo

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

JAVA HttpClient實現頁面資訊抓取(獲取圖片驗證碼並傳入cookie實現資訊獲取)

基於Java的阿里媽媽資料抓取技術

java 動態載入的頁面資料的抓取

爬蟲[1]---頁面分析及資料抓取

asp.net頁面通過Javascript使用CanvasJS.Chart畫曲線，曲線實現動態載入後臺資料（通過ajax）

Vue音樂--排行榜頁面02_抓取首頁資料

[記錄]Java網路爬蟲基礎和抓取網站資料的兩個小例項

有哪些好用的網際網路資料抓取，資料採集，頁面解析工具？

Java網頁資料抓取例項

使用Chrome Headless 快速實現java版數據的抓取

QueryList免費線上網頁採集資料抓取工具-toolfk.com

spider資料抓取（第二章）

Android 使用jsoup 進行資料抓取

爬蟲實戰-酷狗音樂資料抓取--XPath，Pyquery,Beautifulsoup資料提取對比實戰

poi資料抓取和下載

java 動態載入的頁面資料的抓取

相關推薦