1. 程式人生 > >一:HttpClient知識整理

一:HttpClient知識整理

sco ole 機制 抓取 網頁 http協議 ket 請求頭 空白

一:httpclient 簡介

HttpClient 是 Apache Jakarta Common 下的子項目,可以用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包,並且它支持 HTTP 協議最新的版本和建議。

超文本傳輸協議(HTTP)可能是當今Internet上使用的最重要的協議。Web服務,支持網絡的設備和網絡計算的發展繼續將HTTP協議的作用擴展到用戶驅動的Web瀏覽器之外,同時增加了需要HTTP支持的應用程序的數量。盡管java.net包提供了通過HTTP訪問資源的基本功能,但它並未提供許多應用程序所需的完全靈活性或功能。HttpClient旨在通過提供一個高效,最新且功能豐富的包來實現這一空白,該包實現了最新HTTP標準和建議的客戶端。HttpClient專為擴展而設計,同時為基本HTTP協議提供強大支持,HttpClient可能對構建支持HTTP的客戶端應用程序(如Web瀏覽器,Web服務客戶端或利用或擴展HTTP協議進行分布式通信的系統)感興趣。

HttpClient主頁: http://hc.apache.org/

HttpClient下載:http://hc.apache.org/downloads.cgi

最新版本4.5 http://hc.apache.org/httpcomponents-client-4.5.x/

官方文檔: http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

maven地址:

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

二:httpclient使用流程

使用 HttpClient 發送請求、接收響應很簡單,一般需要如下幾步即可。

  • 創建 HttpClient 對象。
  • 創建請求方法的實例,並指定請求 URL。如果需要發送 GET 請求,創建 HttpGet 對象;如果需要發送 POST 請求,創建 HttpPost 對象。
  • 如果需要發送請求參數,可調用 HttpGet、HttpPost 共同的 setParams(HttpParams params) 方法來添加請求參數;對於 HttpPost 對象而言,也可調用 setEntity(HttpEntity entity) 方法來設置請求參數。
  • 調用 HttpClient 對象的 execute(HttpUriRequest request) 發送請求,該方法返回一個 HttpResponse。
  • 調用 HttpResponse 的 getAllHeaders()、getHeaders(String name) 等方法可獲取服務器的響應頭;調用 HttpResponse 的 getEntity() 方法可獲取 HttpEntity 對象,該對象包裝了服務器的響應內容。程序可通過該對象獲取服務器的響應內容。
  • 釋放連接。無論執行方法是否成功,都必須釋放連接

三:HelloWorld 程序

1.創建helloworld程序

public class HelloWorld2 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.創建httpclient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.創建httpget實例(請求)
        HttpGet httpGet = new HttpGet("http://www.java1234.com");
        //3.httpclient執行(httpget)請求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //執行http get請求
        //4.獲取返回的實體(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
        System.out.println("網頁內容是:"+context);
        //5.關閉資源
        response.close();   //response關閉
        httpClient.close(); //httpClient關閉
    }   
}

2.創建HttpGet請求

添加依賴

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>fluent-hc</artifactId>
    <version>4.5.5</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpmime</artifactId>
    <version>4.5.5</version>
</dependency>

public class MyTest {
    public static void main(String[] args) {
        get();
    }
  
    private static void get() {
        // 創建 HttpClient 客戶端,打開瀏覽器
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 創建 HttpGet 請求,輸入url
        HttpGet httpGet = new HttpGet("http://localhost:8080/content/page?draw=1&start=0&length=10");
        // 設置長連接
        httpGet.setHeader("Connection", "keep-alive");
        // 設置代理(模擬瀏覽器版本)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
        // 設置 Cookie
        httpGet.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");

        //發送請求,回車
        CloseableHttpResponse httpResponse = null;
        try {
            // 請求並獲得響應結果
            httpResponse = httpClient.execute(httpGet);
            HttpEntity httpEntity = httpResponse.getEntity();
            // 輸出請求結果
            System.out.println(EntityUtils.toString(httpEntity));
        } catch (IOException e) {
            e.printStackTrace();
        } finally { // 無論如何必須關閉連接
            if (httpResponse != null) {
                try {
                    httpResponse.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (httpClient != null) {
                try {
                    httpClient.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

3.創建HttpPost請求

public class MyTest {
    public static void main(String[] args) {
        post();
    }

    private static void post() {
        // 創建 HttpClient 客戶端
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 創建 HttpPost 請求
        HttpPost httpPost = new HttpPost("http://localhost:8080/content/page");
        // 設置長連接
        httpPost.setHeader("Connection", "keep-alive");
        // 設置代理(模擬瀏覽器版本)
        httpPost.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
        // 設置 Cookie
        httpPost.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");

        // 創建 HttpPost 參數
        List<BasicNameValuePair> params = new ArrayList<BasicNameValuePair>();
        params.add(new BasicNameValuePair("draw", "1"));    //請求參數中的key-value值
        params.add(new BasicNameValuePair("start", "0"));
        params.add(new BasicNameValuePair("length", "10"));

        CloseableHttpResponse httpResponse = null;
        try {
            // 設置 HttpPost 參數
            httpPost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
            httpResponse = httpClient.execute(httpPost);
            HttpEntity httpEntity = httpResponse.getEntity();
            // 輸出請求結果
            System.out.println(EntityUtils.toString(httpEntity));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally { // 無論如何必須關閉連接
            try {
                if (httpResponse != null) {
                    httpResponse.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
            try {
                if (httpClient != null) {
                    httpClient.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

四:模擬瀏覽器抓取網頁

1.設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)

httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");

2.獲取響應內容類型Content-Type

//獲取響應內容類型Content-Type;  getName()是獲取key,getValue()是獲取value
entity.getContentType().getValue();

3.獲取響應狀態Status

response.getStatusLine().getStatusCode();

200 -- 正常
403 -- 拒絕
500 -- 服務器報錯
400 -- 未找到頁面

4.示例

public class Demo2 {    
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.創建httpclient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.創建httpget實例(請求)
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        //設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient執行(httpget)請求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //執行http get請求
        System.out.println("Status:"+response.getStatusLine().getStatusCode()); //獲取響應狀態Status
        
        //4.獲取返回的實體(entity)
        HttpEntity entity = response.getEntity();
        //獲取響應內容類型Content-Type;  getName()是獲取key,getValue()是獲取value
        System.out.println("Content-Type:"+entity.getContentType().getValue());
        //獲取網頁內容
//      String context = EntityUtils.toString(entity, "utf-8");
//      System.out.println("網頁內容是:"+context);
        
        //5.關閉資源
        response.close();   //response關閉
        httpClient.close(); //httpClient關閉
    }
}

五:httpclient 抓取圖片

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.創建httpclient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.創建httpget實例(請求)
        HttpGet httpGet = new HttpGet("http://www.java1234.com/uploads/allimg/161105/1-161105150121954.jpg");
        //設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient執行(httpget)請求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //執行http get請求
        
        //4.獲取返回的實體(entity)
        HttpEntity entity = response.getEntity();
        if(entity!=null) {
            //打印實體的內容類型
            System.out.println("Content-Type:"+entity.getContentType().getValue());
            //獲取實體的輸入流
            InputStream inputStream = entity.getContent();
            //將輸入流復制到新建的文件
            FileUtils.copyToFile(inputStream, new File("E://mysource/picture/aaa.jpg"));
        }
        
        //5.關閉資源
        response.close();   //response關閉
        httpClient.close(); //httpClient關閉
    }   
}

六:httpclient 使用代理ip

在爬取網頁的時候,有的目標站點有反爬蟲機制,對於頻繁訪問站點以及規則性訪問站點的行為,會采集屏蔽IP措施。

關於代理IP的話 也分幾種 透明代理、匿名代理、混淆代理、高匿代理。

1.透明代理(Transparent Proxy)

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP

透明代理雖然可以直接“隱藏”你的IP地址,但是還是可以從HTTP_X_FORWARDED_FOR來查到你是誰。

2.匿名代理(Anonymous Proxy)

REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP

匿名代理比透明代理進步了一點:別人只能知道你用了代理,無法知道你是誰。

3.混淆代理(Distorting Proxies)

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address

如上,與匿名代理相同,如果使用了混淆代理,別人還是能知道你在用代理,但是會得到一個假的IP地址,偽裝的更逼真.

4.高匿代理(Elite proxy或High Anonymity Proxy)

REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined

可以看出來,高匿代理讓別人根本無法發現你是在用代理,所以是最好的選擇.

那代理IP 從哪裏搞呢 很簡單 百度一下,你就知道 一大堆代理IP站點。 一般都會給出一些免費的,但是花點錢搞收費接口更加方便;比如 http://www.66ip.cn/

5.示例

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.創建httpclient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.創建httpget實例(請求)
        HttpGet httpGet = new HttpGet("http://www.tuicool.com");
        //設置代理ip
        HttpHost proxy = new HttpHost("42.121.15.99",3128);
        RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
        httpGet.setConfig(config);
        //設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient執行(httpget)請求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //執行http get請求
        
        //4.獲取返回的實體(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
        System.out.println("網頁內容是:"+context);
        
        //5.關閉資源
        response.close();   //response關閉
        httpClient.close(); //httpClient關閉
    }   
}

七:httpclient 連接超時及讀取超時

httpClient在執行具體http請求時候 有一個連接的時間和讀取內容的時間;

HttpClient連接時間,所謂連接的時候 是HttpClient發送請求的地方開始到連接上目標url主機地址的時間。

HttpClient讀取時間,所謂讀取的時間 是HttpClient已經連接到了目標服務器,然後進行內容數據的獲取。

國外maven倉庫地址:http://central.maven.org/maven2/

示例:

public class Demo1 {
    public static void main(String[] args) throws ClientProtocolException, IOException {
        //1.創建httpclient實例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        
        //2.創建httpget實例(請求)
        HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/");
        //設置連接超時及讀取超時
        RequestConfig config=RequestConfig.custom()
                .setConnectTimeout(1000)    //設置連接超時時間(單位毫秒)
                .setSocketTimeout(1000) //設置讀取超時時間(單位毫秒)
                .build();
        httpGet.setConfig(config);
        //設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
        
        //3.httpclient執行(httpget)請求
        CloseableHttpResponse response = httpClient.execute(httpGet);   //執行http get請求
        
        //4.獲取返回的實體(entity)
        HttpEntity entity = response.getEntity();
        String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
        System.out.println("網頁內容是:"+context);
        
        //5.關閉資源
        response.close();   //response關閉
        httpClient.close(); //httpClient關閉
    }
}

一:HttpClient知識整理