Java爬蟲學習:利用HttpClient和Jsoup庫實現簡單的Java爬蟲程式
阿新 • • 發佈:2019-01-05
利用HttpClient和Jsoup庫實現簡單的Java爬蟲程式
HttpClient簡介
HttpClient是Apache Jakarta Common下的子專案,可以用來提供高效的、最新的、功能豐富的支援HTTP協議的客戶端程式設計工具包,並且它支援 HTTP 協議最新的版本。它的主要功能有:
- (1) 實現了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
- (2) 支援自動轉向
- (3) 支援 HTTPS 協議
- (4) 支援代理伺服器等
Jsoup簡介
jsoup是一款Java的HTML解析器,可直接解析某個URL地址、HTML文字內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似於jQuery的操作方法來取出和操作資料。它的主要功能有:
- (1) 從一個URL,檔案或字串中解析HTML;
- (2) 使用DOM或CSS選擇器來查詢、取出資料;
- (3) 可操作HTML元素、屬性、文字;
使用步驟
maven專案新增依賴
pom.xml檔案依賴如下:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId >jsoup</artifactId>
<version>1.8.3</version>
</dependency>
編寫Junit測試程式碼
程式碼
import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;
import java.util.List;
/**
* HttpClient & Jsoup libruary test class
*
* Created by xuyh at 2017/11/6 15:28.
*/
public class HttpClientJsoupTest {
@Test
public void test() {
//通過httpClient獲取網頁響應,將返回的響應解析為純文字
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
String responseStr = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();
response = httpClient.execute(httpGet, context);
int state = response.getStatusLine().getStatusCode();
if (state != 200)
responseStr = "";
HttpEntity entity = response.getEntity();
if (entity != null)
responseStr = EntityUtils.toString(entity, "utf-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
if (responseStr == null)
return;
//將解析到的純文字用Jsoup工具轉換成Document文件並進行操作
Document document = Jsoup.parse(responseStr);
List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");
elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
}
}
詳解
- 新建HttpGet物件,物件將從 http://sports.sina.com.cn/ 這個URL地址獲取GET響應。並設定socket超時時間和連線超時時間分別為30000ms。
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
- 通過HttpClientBuilder新建一個CloseableHttpClient物件,並執行上面的HttpGet規定的請求,將響應放在新建的HttpClientContext物件中。最後從HttpClientContext物件中獲取響應的文字格式。
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
String responseStr = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();
response = httpClient.execute(httpGet, context);
int state = response.getStatusLine().getStatusCode();
if (state != 200)
responseStr = "";
HttpEntity entity = response.getEntity();
if (entity != null)
responseStr = EntityUtils.toString(entity, "utf-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
- 將響應的文字用Jsoup庫解析,得到其中的各個元素
Document document = Jsoup.parse(responseStr);
List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");
elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
- Jsoup的Document物件繼承自org.jsoup.nodes.Element類和Element均有的部分方法:
public Element getElementById(String id);//通過id獲取元素
public Elements getElementsByClass(String className);//通過className獲取元素
public Elements getElementsByAttributeValue(String key, String value);//通過屬性值獲取元素
public Elements getElementsByTag(String tagName);//通過標籤名獲取元素
public String attr(String attributeKey);//獲取本元素的屬性值
public String text();//獲取本元素的內容
- 其中HTML規定的元素格式為:
<div class="code"> <!--div 是元素的標籤--> <!--class="code" 是元素的屬性和屬性值-->
<div>
<br>
這是第一個段落。 <!--元素的內容-->
<br>
</div>
</div>
執行結果
- 執行結果如下所示
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黃金聯賽冠軍賽山西隊奪冠!獨享48萬
http://video.sina.com.cn/sports/k/cba/1105final3x3/
視訊
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黃金mvp集錦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直搗黃龍1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:庫裡式虛晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐鼕鼕亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
現場眾多美女雲集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦隊熱舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭勝
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分騎士負
- 爬取的網頁內容區域為下圖所示:
編寫工具類
將HttpClient和Jsoup進行封裝,形成一個工具類,內容如下:
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.CookieStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.cookie.Cookie;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* <pre>
* Http工具,包含:
* 普通http請求工具(使用httpClient進行http,https請求的傳送)
* </pre>
* Created by xuyh at 2017/7/17 19:08.
*/
public class HttpUtils {
/**
* 請求超時時間,預設20000ms
*/
private int timeout = 20000;
/**
* cookie表
*/
private Map<String, String> cookieMap = new HashMap<>();
/**
* 請求編碼(處理返回結果),預設UTF-8
*/
private String charset = "UTF-8";
private static HttpUtils httpUtils;
private HttpUtils() {
}
/**
* 獲取例項
*
* @return
*/
public static HttpUtils getInstance() {
if (httpUtils == null)
httpUtils = new HttpUtils();
return httpUtils;
}
/**
* 清空cookieMap
*/
public void invalidCookieMap() {
cookieMap.clear();
}
public int getTimeout() {
return timeout;
}
/**
* 設定請求超時時間
*
* @param timeout
*/
public void setTimeout(int timeout) {
this.timeout = timeout;
}
public String getCharset() {
return charset;
}
/**
* 設定請求字元編碼集
*
* @param charset
*/
public void setCharset(String charset) {
this.charset = charset;
}
/**
* 將網頁返回為解析後的文件格式
*
* @param html
* @return
* @throws Exception
*/
public static Document parseHtmlToDoc(String html) throws Exception {
return removeHtmlSpace(html);
}
private static Document removeHtmlSpace(String str) {
Document doc = Jsoup.parse(str);
String result = doc.html().replace(" ", "");
return Jsoup.parse(result);
}
/**
* 執行get請求,返回doc
*
* @param url
* @return
* @throws Exception
*/
public Document executeGetAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGet(url));
}
/**
* 執行get請求
*
* @param url
* @return
* @throws Exception
*/
public String executeGet(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext context = HttpClientContext.create();
CloseableHttpResponse response = httpClient.execute(httpGet, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}
/**
* 用https執行get請求,返回doc
*
* @param url
* @return
* @throws Exception
*/
public Document executeGetWithSSLAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGetWithSSL(url));
}
/**
* 用https執行get請求
*
* @param url
* @return
* @throws Exception
*/
public String executeGetWithSSL(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = createSSLInsecureClient();
HttpClientContext context = HttpClientContext.create();
CloseableHttpResponse response = httpClient.execute(httpGet, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} catch (GeneralSecurityException ex) {
throw ex;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}
/**
* 執行post請求,返回doc
*
* @param url
* @param params
* @return
* @throws Exception
*/
public Document executePostAsDocument(String url, Map<String, String> params) throws Exception {
return parseHtmlToDoc(executePost(url, params));
}
/**
* 執行post請求
*
* @param url
* @param params
* @return
* @throws Exception
*/
public String executePost(String url, Map<String, String> params) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
List<NameValuePair> paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));
HttpClientContext context = HttpClientContext.create();
response = httpclient.execute(httpPost, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}
/**
* 用https執行post請求,返回doc
*
* @param url
* @param params
* @return
* @throws Exception
*/
public Document executePostWithSSLAsDocument(String url, Map<String, String> params) throws Exception {
return parseHtmlToDoc(executePostWithSSL(url, params));
}
/**
* 用https執行post請求
*
* @param url
* @param params
* @return
* @throws Exception
*/
public String executePostWithSSL(String url, Map<String, String> params) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
List<NameValuePair> paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
post.setHeader("Cookie", convertCookieMapToString(cookieMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext contextRe = HttpClientContext.create();
post.setEntity(new UrlEncodedFormEntity(paramsRe));
response = httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
} catch (Exception e) {
throw e;
}
return re;
}
/**
* 傳送JSON格式body的POST請求
*
* @param url 地址
* @param jsonBody json body
* @return
* @throws Exception
*/
public String executePostWithJson(String url, String jsonBody) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
HttpClientContext context = HttpClientContext.create();
response = httpclient.execute(httpPost, context);
getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}
/**
* 傳送JSON格式body的SSL POST請求
*
* @param url 地址
* @param jsonBody json body
* @return
* @throws Exception
*/
public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
post.setHeader("Cookie", convertCookieMapToString(cookieMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext contextRe = HttpClientContext.create();
post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
response = httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
} catch (Exception e) {
throw e;
}
return re;
}
private void getCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap) {
List<Cookie> cookies = cookieStore.getCookies();
for (Cookie cookie : cookies) {
cookieMap.put(cookie.getName(), cookie.getValue());
}
}
private String convertCookieMapToString(Map<String, String> map) {
String cookie = "";
for (String key : map.keySet()) {
cookie += (key + "=" + map.get(key) + "; ");
}
if (map.size() > 0) {
cookie = cookie.substring(0, cookie.length() - 2);
}
return cookie;
}
/**
* 建立 SSL連線
*
* @return
* @throws GeneralSecurityException
*/
private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();
SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,
(s, sslContextL) -> true);
return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();
} catch (GeneralSecurityException e) {
throw e;
}
}
}
上面的工具類不僅可以進行網頁內容的獲取,還能夠進行http請求的傳送。
原始碼地址
https://github.com/johnsonmoon/HttpUtils.git