基於spark的網路爬蟲實現
爬蟲是獲取網路大資料的重要手段,爬蟲是一種非常成熟的技術了,然而想著在Spark環境下測試一下效果.
還是非常簡單的,利用JavaSparkContext來構建,就可以採用原來Java中的網頁獲取那一套來實現.
首先給定幾個初始種子,生成一個JavaRDD物件即可
JavaRDD<String> rdd = sc.parallelize("urllist");
JavaRDD<String> content = rdd.map(new Function<String, String>() {
public String call(String url) throws Exception {
System.out.println(url);
CloseableHttpClient client = null;
HttpGet get = null;
CloseableHttpResponse response = null;
try {
//## 建立預設連線
client = HttpClients.createDefault();
get = new HttpGet(url);
response = client.execute(get);
HttpEntity entity = response.getEntity();
//## 獲得輸出位元組流
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
entity.writeTo(byteArrayOutputStream);
//## 轉化為文件
String html = new String(byteArrayOutputStream.toByteArray(), Charsets.UTF_8);
Document document = Jsoup.parse(html);
return html;
} catch (Exception ex) {
ex.printStackTrace();
return "";
} finally {
if (response != null) {
response.close();
}
if (client != null) {
client.close();
}
}
}
});
當然可以從HTML再找到子頁連線,繼續以深度或者廣度進行優先爬蟲.