動態網頁爬取樣例(WebCollector+selenium+phantomjs)
目標:動態網頁爬取
說明:這裏的動態網頁指幾種可能:1)須要用戶交互,如常見的登錄操作;2)網頁通過JS / AJAX動態生成。如一個html裏有<div id="test"></div>,通過JS生成<div id="test"><span>aaa</span></div>。
這裏用了WebCollector 2進行爬蟲,這東東也方便,只是要支持動態關鍵還是要靠另外一個API -- selenium 2(集成htmlunit 和 phantomjs).
1)須要登錄後的爬取,如新浪微博
import java.util.Set; import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler; import cn.edu.hfut.dmic.webcollector.model.Links; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.net.HttpRequesterImpl; import org.openqa.selenium.Cookie; import org.openqa.selenium.WebElement; import org.openqa.selenium.htmlunit.HtmlUnitDriver; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /* * 登錄後爬取 * Refer: http://nutcher.org/topics/33 * https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md * Lib required: webcollector-2.07-bin, selenium-java-2.44.0 & its lib */ public class WebCollector1 extends DeepCrawler { public WebCollector1(String crawlPath) { super(crawlPath); /*獲取新浪微博的cookie,賬號密碼以明文形式傳輸。請使用小號*/ try { String cookie=WebCollector1.WeiboCN.getSinaCookie("yourAccount", "yourPwd"); HttpRequesterImpl myRequester=(HttpRequesterImpl) this.getHttpRequester(); myRequester.setCookie(cookie); } catch (Exception e) { e.printStackTrace(); } } @Override public Links visitAndGetNextLinks(Page page) { /*抽取微博*/ Elements weibos=page.getDoc().select("div.c"); for(Element weibo:weibos){ System.out.println(weibo.text()); } /*假設要爬取評論,這裏能夠抽取評論頁面的URL。返回*/ return null; } public static void main(String[] args) { WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo"); crawler.setThreads(3); /*對某人微博前5頁進行爬取*/ for(int i=0;i<5;i++){ crawler.addSeed("http://weibo.cn/zhouhongyi?vt=4&page="+i); } try { crawler.start(1); } catch (Exception e) { e.printStackTrace(); } } public static class WeiboCN { /** * 獲取新浪微博的cookie。這種方法針對weibo.cn有效,對weibo.com無效 * weibo.cn以明文形式數據傳輸。請使用小號 * @param username 新浪微博用戶名 * @param password 新浪微博密碼 * @return * @throws Exception */ public static String getSinaCookie(String username, String password) throws Exception{ StringBuilder sb = new StringBuilder(); HtmlUnitDriver driver = new HtmlUnitDriver(); driver.setJavascriptEnabled(true); driver.get("http://login.weibo.cn/login/"); WebElement mobile = driver.findElementByCssSelector("input[name=mobile]"); mobile.sendKeys(username); WebElement pass = driver.findElementByCssSelector("input[name^=password]"); pass.sendKeys(password); WebElement rem = driver.findElementByCssSelector("input[name=remember]"); rem.click(); WebElement submit = driver.findElementByCssSelector("input[name=submit]"); submit.click(); Set<Cookie> cookieSet = driver.manage().getCookies(); driver.close(); for (Cookie cookie : cookieSet) { sb.append(cookie.getName()+"="+cookie.getValue()+";"); } String result=sb.toString(); if(result.contains("gsid_CTandWM")){ return result; }else{ throw new Exception("weibo login failed"); } } } }
* 這裏有個自己定義路徑/home/hu/data/weibo(WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo");),是用來保存到嵌入式數據庫Berkeley DB。
* 整體上來自Webcollector 作者的sample。
2)JS動態生成HTML元素的爬取
import java.util.List; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler; import cn.edu.hfut.dmic.webcollector.model.Links; import cn.edu.hfut.dmic.webcollector.model.Page; /* * JS爬取 * Refer: http://blog.csdn.net/smilings/article/details/7395509 */ public class WebCollector3 extends DeepCrawler { public WebCollector3(String crawlPath) { super(crawlPath); // TODO Auto-generated constructor stub } @Override public Links visitAndGetNextLinks(Page page) { /*HtmlUnitDriver能夠抽取JS生成的數據*/ // HtmlUnitDriver driver=PageUtils.getDriver(page,BrowserVersion.CHROME); // String content = PageUtils.getPhantomJSDriver(page); WebDriver driver = PageUtils.getWebDriver(page); // List<WebElement> divInfos=driver.findElementsByCssSelector("#feed_content"); List<WebElement> divInfos=driver.findElements(By.cssSelector("#feed_content span")); for(WebElement divInfo:divInfos){ System.out.println("Text是:" + divInfo.getText()); } return null; } public static void main(String[] args) { WebCollector3 crawler=new WebCollector3("/home/hu/data/wb"); for(int page=1;page<=5;page++) // crawler.addSeed("http://www.sogou.com/web?query="+URLEncoder.encode("編程")+"&page="+page); crawler.addSeed("http://cq.qq.com/baoliao/detail.htm?294064"); try { crawler.start(1); } catch (Exception e) { e.printStackTrace(); } } }
PageUtils.java
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import org.openqa.selenium.JavascriptExecutor; import org.openqa.selenium.WebDriver; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.htmlunit.HtmlUnitDriver; import org.openqa.selenium.ie.InternetExplorerDriver; import org.openqa.selenium.phantomjs.PhantomJSDriver; import com.gargoylesoftware.htmlunit.BrowserVersion; import cn.edu.hfut.dmic.webcollector.model.Page; public class PageUtils { public static HtmlUnitDriver getDriver(Page page) { HtmlUnitDriver driver = new HtmlUnitDriver(); driver.setJavascriptEnabled(true); driver.get(page.getUrl()); return driver; } public static HtmlUnitDriver getDriver(Page page, BrowserVersion browserVersion) { HtmlUnitDriver driver = new HtmlUnitDriver(browserVersion); driver.setJavascriptEnabled(true); driver.get(page.getUrl()); return driver; } public static WebDriver getWebDriver(Page page) { // WebDriver driver = new HtmlUnitDriver(true); // System.setProperty("webdriver.chrome.driver", "D:\\Installs\\Develop\\crawling\\chromedriver.exe"); // WebDriver driver = new ChromeDriver(); System.setProperty("phantomjs.binary.path", "D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe"); WebDriver driver = new PhantomJSDriver(); driver.get(page.getUrl()); // JavascriptExecutor js = (JavascriptExecutor) driver; // js.executeScript("function(){}"); return driver; } public static String getPhantomJSDriver(Page page) { Runtime rt = Runtime.getRuntime(); Process process = null; try { process = rt.exec("D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe " + "D:\\workspace\\crawlTest1\\src\\crawlTest1\\parser.js " + page.getUrl().trim()); InputStream in = process.getInputStream(); InputStreamReader reader = new InputStreamReader( in, "UTF-8"); BufferedReader br = new BufferedReader(reader); StringBuffer sbf = new StringBuffer(); String tmp = ""; while((tmp = br.readLine())!=null){ sbf.append(tmp); } return sbf.toString(); } catch (IOException e) { e.printStackTrace(); } return null; } }
2.2)這裏用了幾種方法:HtmlUnitDriver, ChromeDriver, PhantomJSDriver, PhantomJS,參考 http://blog.csdn.net/five3/article/details/19085303。各自之間的優缺點例如以下:
driver類型 | 長處 | 缺點 | 應用 |
真實瀏覽器driver | 真實模擬用戶行為 | 效率、穩定性低 | 兼容性測試 |
HtmlUnit | 速度快 | js引擎不是主流的瀏覽器支持的 | 包括少量js的頁面測試 |
PhantomJS | 速度中等、模擬行為接近真實 | 不能模擬不同/特定瀏覽器的行為 | 非GUI的功能性測試 |
2.3)用PhantomJSDriver的時候,遇上錯誤:ClassNotFoundException: org.openqa.selenium.browserlaunchers.Proxies,原因居然是selenium 2.44 的bug。後來通過maven找到phantomjsdriver-1.2.1.jar 才攻克了。
2.4)另外。我還試了PhantomJS 原生調用(也就是不用selenium,直接調用PhantomJS。見上面的方法)。原生要調用JS,這裏的parser.js代碼例如以下:
system = require(‘system‘) address = system.args[1];//獲得命令行第二個參數 接下來會用到 //console.log(‘Loading a web page‘); var page = require(‘webpage‘).create(); var url = address; //console.log(url); page.open(url, function (status) { //Page is loaded! if (status !== ‘success‘) { console.log(‘Unable to post!‘); } else { //此處的打印,是將結果一流的形式output到java中,java通過InputStream能夠獲取該輸出內容 console.log(page.content); } phantom.exit(); });
3)後話
3.1)HtmlUnitDriver + PhantomJSDriver是當前最可靠的動態抓取方案。
3.2)這過程中用到非常多包、exe,遇到非常多的墻~,有須要的朋友能夠找我要。
Reference
http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/
http://blog.csdn.net/smilings/article/details/7395509
http://phantomjs.org/download.html
http://blog.csdn.net/five3/article/details/19085303
http://phantomjs.org/quick-start.html
... ...
動態網頁爬取樣例(WebCollector+selenium+phantomjs)