1. 程式人生 > >Java爬蟲學習:利用HttpClient和Jsoup庫實現簡單的Java爬蟲程式

Java爬蟲學習:利用HttpClient和Jsoup庫實現簡單的Java爬蟲程式

利用HttpClient和Jsoup庫實現簡單的Java爬蟲程式

HttpClient簡介

HttpClient是Apache Jakarta Common下的子專案,可以用來提供高效的、最新的、功能豐富的支援HTTP協議的客戶端程式設計工具包,並且它支援 HTTP 協議最新的版本。它的主要功能有:

  • (1) 實現了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
  • (2) 支援自動轉向
  • (3) 支援 HTTPS 協議
  • (4) 支援代理伺服器等

Jsoup簡介

jsoup是一款Java的HTML解析器,可直接解析某個URL地址、HTML文字內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似於jQuery的操作方法來取出和操作資料。它的主要功能有:
- (1) 從一個URL,檔案或字串中解析HTML;
- (2) 使用DOM或CSS選擇器來查詢、取出資料;
- (3) 可操作HTML元素、屬性、文字;

使用步驟

maven專案新增依賴

pom.xml檔案依賴如下:

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId
>
jsoup</artifactId> <version>1.8.3</version> </dependency>

編寫Junit測試程式碼

程式碼


import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import
org.apache.http.client.protocol.HttpClientContext; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.junit.Test; import java.util.List; /** * HttpClient & Jsoup libruary test class * * Created by xuyh at 2017/11/6 15:28. */ public class HttpClientJsoupTest { @Test public void test() { //通過httpClient獲取網頁響應,將返回的響應解析為純文字 HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/"); httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build()); CloseableHttpClient httpClient = null; CloseableHttpResponse response = null; String responseStr = ""; try { httpClient = HttpClientBuilder.create().build(); HttpClientContext context = HttpClientContext.create(); response = httpClient.execute(httpGet, context); int state = response.getStatusLine().getStatusCode(); if (state != 200) responseStr = ""; HttpEntity entity = response.getEntity(); if (entity != null) responseStr = EntityUtils.toString(entity, "utf-8"); } catch (Exception e) { e.printStackTrace(); } finally { try { if (response != null) response.close(); if (httpClient != null) httpClient.close(); } catch (Exception ex) { ex.printStackTrace(); } } if (responseStr == null) return; //將解析到的純文字用Jsoup工具轉換成Document文件並進行操作 Document document = Jsoup.parse(responseStr); List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first() .getElementsByAttributeValue("class", "phdnews_hdline"); elements.forEach(element -> { for (Element e : element.getElementsByTag("a")) { System.out.println(e.attr("href")); System.out.println(e.text()); } }); } }

詳解

  • 新建HttpGet物件,物件將從 http://sports.sina.com.cn/ 這個URL地址獲取GET響應。並設定socket超時時間和連線超時時間分別為30000ms。
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
  • 通過HttpClientBuilder新建一個CloseableHttpClient物件,並執行上面的HttpGet規定的請求,將響應放在新建的HttpClientContext物件中。最後從HttpClientContext物件中獲取響應的文字格式。
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;

String responseStr = "";
try {
    httpClient = HttpClientBuilder.create().build();
    HttpClientContext context = HttpClientContext.create();

    response = httpClient.execute(httpGet, context);

    int state = response.getStatusLine().getStatusCode();
    if (state != 200)
        responseStr = "";


    HttpEntity entity = response.getEntity();
    if (entity != null)
        responseStr = EntityUtils.toString(entity, "utf-8");


} catch (Exception e) {
    e.printStackTrace();
} finally {
    try {
        if (response != null)
            response.close();
        if (httpClient != null)
            httpClient.close();
    } catch (Exception ex) {
        ex.printStackTrace();
    }
}
  • 將響應的文字用Jsoup庫解析,得到其中的各個元素
Document document = Jsoup.parse(responseStr);

List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
        .getElementsByAttributeValue("class", "phdnews_hdline");

elements.forEach(element -> {
    for (Element e : element.getElementsByTag("a")) {
        System.out.println(e.attr("href"));
        System.out.println(e.text());
    }
});
  • Jsoup的Document物件繼承自org.jsoup.nodes.Element類和Element均有的部分方法:
public Element getElementById(String id);//通過id獲取元素
public Elements getElementsByClass(String className);//通過className獲取元素
public Elements getElementsByAttributeValue(String key, String value);//通過屬性值獲取元素
public Elements getElementsByTag(String tagName);//通過標籤名獲取元素
public String attr(String attributeKey);//獲取本元素的屬性值
public String text();//獲取本元素的內容
  • 其中HTML規定的元素格式為:
<div class="code">  <!--div 是元素的標籤--> <!--class="code" 是元素的屬性和屬性值-->
    <div>
        <br>
            這是第一個段落。    <!--元素的內容-->
        <br>
    </div>
</div>

執行結果

  • 執行結果如下所示
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黃金聯賽冠軍賽山西隊奪冠!獨享48http://video.sina.com.cn/sports/k/cba/1105final3x3/
視訊
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黃金mvp集錦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直搗黃龍1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:庫裡式虛晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐鼕鼕亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
現場眾多美女雲集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦隊熱舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭勝
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分騎士負
  • 爬取的網頁內容區域為下圖所示:

這裡寫圖片描述

編寫工具類

將HttpClient和Jsoup進行封裝,形成一個工具類,內容如下:


import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.CookieStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.cookie.Cookie;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * <pre>
 * Http工具,包含:
 * 普通http請求工具(使用httpClient進行http,https請求的傳送)
 * </pre>
 * Created by xuyh at 2017/7/17 19:08.
 */
public class HttpUtils {
    /**
     * 請求超時時間,預設20000ms
     */
    private int timeout = 20000;
    /**
     * cookie表
     */
    private Map<String, String> cookieMap = new HashMap<>();

    /**
     * 請求編碼(處理返回結果),預設UTF-8
     */
    private String charset = "UTF-8";

    private static HttpUtils httpUtils;

    private HttpUtils() {
    }

    /**
     * 獲取例項
     *
     * @return
     */
    public static HttpUtils getInstance() {
        if (httpUtils == null)
            httpUtils = new HttpUtils();
        return httpUtils;
    }

    /**
     * 清空cookieMap
     */
    public void invalidCookieMap() {
        cookieMap.clear();
    }

    public int getTimeout() {
        return timeout;
    }

    /**
     * 設定請求超時時間
     *
     * @param timeout
     */
    public void setTimeout(int timeout) {
        this.timeout = timeout;
    }

    public String getCharset() {
        return charset;
    }

    /**
     * 設定請求字元編碼集
     *
     * @param charset
     */
    public void setCharset(String charset) {
        this.charset = charset;
    }

    /**
     * 將網頁返回為解析後的文件格式
     * 
     * @param html
     * @return
     * @throws Exception
     */
    public static Document parseHtmlToDoc(String html) throws Exception {
        return removeHtmlSpace(html);
    }

    private static Document removeHtmlSpace(String str) {
        Document doc = Jsoup.parse(str);
        String result = doc.html().replace("&nbsp;", "");
        return Jsoup.parse(result);
    }

    /**
     * 執行get請求,返回doc
     *
     * @param url
     * @return
     * @throws Exception
     */
    public Document executeGetAsDocument(String url) throws Exception {
        return parseHtmlToDoc(executeGet(url));
    }

    /**
     * 執行get請求
     *
     * @param url
     * @return
     * @throws Exception
     */
    public String executeGet(String url) throws Exception {
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpClient httpClient = null;
        String str = "";
        try {
            httpClient = HttpClientBuilder.create().build();
            HttpClientContext context = HttpClientContext.create();
            CloseableHttpResponse response = httpClient.execute(httpGet, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            int state = response.getStatusLine().getStatusCode();
            if (state == 404) {
                str = "";
            }
            try {
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    str = EntityUtils.toString(entity, charset);
                }
            } finally {
                response.close();
            }
        } catch (IOException e) {
            throw e;
        } finally {
            try {
                if (httpClient != null)
                    httpClient.close();
            } catch (IOException e) {
                throw e;
            }
        }
        return str;
    }

    /**
     * 用https執行get請求,返回doc
     *
     * @param url
     * @return
     * @throws Exception
     */
    public Document executeGetWithSSLAsDocument(String url) throws Exception {
        return parseHtmlToDoc(executeGetWithSSL(url));
    }

    /**
     * 用https執行get請求
     *
     * @param url
     * @return
     * @throws Exception
     */
    public String executeGetWithSSL(String url) throws Exception {
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpClient httpClient = null;
        String str = "";
        try {
            httpClient = createSSLInsecureClient();
            HttpClientContext context = HttpClientContext.create();
            CloseableHttpResponse response = httpClient.execute(httpGet, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            int state = response.getStatusLine().getStatusCode();
            if (state == 404) {
                str = "";
            }
            try {
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    str = EntityUtils.toString(entity, charset);
                }
            } finally {
                response.close();
            }
        } catch (IOException e) {
            throw e;
        } catch (GeneralSecurityException ex) {
            throw ex;
        } finally {
            try {
                if (httpClient != null)
                    httpClient.close();
            } catch (IOException e) {
                throw e;
            }
        }
        return str;
    }

    /**
     * 執行post請求,返回doc
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public Document executePostAsDocument(String url, Map<String, String> params) throws Exception {
        return parseHtmlToDoc(executePost(url, params));
    }

    /**
     * 執行post請求
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public String executePost(String url, Map<String, String> params) throws Exception {
        String reStr = "";
        HttpPost httpPost = new HttpPost(url);
        httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
        List<NameValuePair> paramsRe = new ArrayList<>();
        for (String key : params.keySet()) {
            paramsRe.add(new BasicNameValuePair(key, params.get(key)));
        }
        CloseableHttpClient httpclient = HttpClientBuilder.create().build();
        CloseableHttpResponse response;
        try {
            httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));
            HttpClientContext context = HttpClientContext.create();
            response = httpclient.execute(httpPost, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            HttpEntity entity = response.getEntity();
            reStr = EntityUtils.toString(entity, charset);
        } catch (IOException e) {
            throw e;
        } finally {
            httpPost.releaseConnection();
        }
        return reStr;
    }

    /**
     * 用https執行post請求,返回doc
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public Document executePostWithSSLAsDocument(String url, Map<String, String> params) throws Exception {
        return parseHtmlToDoc(executePostWithSSL(url, params));
    }

    /**
     * 用https執行post請求
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public String executePostWithSSL(String url, Map<String, String> params) throws Exception {
        String re = "";
        HttpPost post = new HttpPost(url);
        List<NameValuePair> paramsRe = new ArrayList<>();
        for (String key : params.keySet()) {
            paramsRe.add(new BasicNameValuePair(key, params.get(key)));
        }
        post.setHeader("Cookie", convertCookieMapToString(cookieMap));
        post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpResponse response;
        try {
            CloseableHttpClient httpClientRe = createSSLInsecureClient();
            HttpClientContext contextRe = HttpClientContext.create();
            post.setEntity(new UrlEncodedFormEntity(paramsRe));
            response = httpClientRe.execute(post, contextRe);
            HttpEntity entity = response.getEntity();
            if (entity != null) {
                re = EntityUtils.toString(entity, charset);
            }
            getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
        } catch (Exception e) {
            throw e;
        }
        return re;
    }

    /**
     * 傳送JSON格式body的POST請求
     *
     * @param url 地址
     * @param jsonBody json body
     * @return
     * @throws Exception
     */
    public String executePostWithJson(String url, String jsonBody) throws Exception {
        String reStr = "";
        HttpPost httpPost = new HttpPost(url);
        httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
        CloseableHttpClient httpclient = HttpClientBuilder.create().build();
        CloseableHttpResponse response;
        try {
            httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
            HttpClientContext context = HttpClientContext.create();
            response = httpclient.execute(httpPost, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            HttpEntity entity = response.getEntity();
            reStr = EntityUtils.toString(entity, charset);
        } catch (IOException e) {
            throw e;
        } finally {
            httpPost.releaseConnection();
        }
        return reStr;
    }

    /**
     * 傳送JSON格式body的SSL POST請求
     *
     * @param url 地址
     * @param jsonBody json body
     * @return
     * @throws Exception
     */
    public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception {
        String re = "";
        HttpPost post = new HttpPost(url);
        post.setHeader("Cookie", convertCookieMapToString(cookieMap));
        post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpResponse response;
        try {
            CloseableHttpClient httpClientRe = createSSLInsecureClient();
            HttpClientContext contextRe = HttpClientContext.create();
            post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
            response = httpClientRe.execute(post, contextRe);
            HttpEntity entity = response.getEntity();
            if (entity != null) {
                re = EntityUtils.toString(entity, charset);
            }
            getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
        } catch (Exception e) {
            throw e;
        }
        return re;
    }

    private void getCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap) {
        List<Cookie> cookies = cookieStore.getCookies();
        for (Cookie cookie : cookies) {
            cookieMap.put(cookie.getName(), cookie.getValue());
        }
    }

    private String convertCookieMapToString(Map<String, String> map) {
        String cookie = "";
        for (String key : map.keySet()) {
            cookie += (key + "=" + map.get(key) + "; ");
        }
        if (map.size() > 0) {
            cookie = cookie.substring(0, cookie.length() - 2);
        }
        return cookie;
    }

    /**
     * 建立 SSL連線
     *
     * @return
     * @throws GeneralSecurityException
     */
    private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException {
        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();
            SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,
                    (s, sslContextL) -> true);
            return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();
        } catch (GeneralSecurityException e) {
            throw e;
        }
    }
}

上面的工具類不僅可以進行網頁內容的獲取,還能夠進行http請求的傳送。

原始碼地址

https://github.com/johnsonmoon/HttpUtils.git