一:HttpClient知識整理
一:httpclient 簡介
HttpClient 是 Apache Jakarta Common 下的子項目,可以用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包,並且它支持 HTTP 協議最新的版本和建議。
超文本傳輸協議(HTTP)可能是當今Internet上使用的最重要的協議。Web服務,支持網絡的設備和網絡計算的發展繼續將HTTP協議的作用擴展到用戶驅動的Web瀏覽器之外,同時增加了需要HTTP支持的應用程序的數量。盡管java.net包提供了通過HTTP訪問資源的基本功能,但它並未提供許多應用程序所需的完全靈活性或功能。HttpClient旨在通過提供一個高效,最新且功能豐富的包來實現這一空白,該包實現了最新HTTP標準和建議的客戶端。HttpClient專為擴展而設計,同時為基本HTTP協議提供強大支持,HttpClient可能對構建支持HTTP的客戶端應用程序(如Web瀏覽器,Web服務客戶端或利用或擴展HTTP協議進行分布式通信的系統)感興趣。
HttpClient主頁: http://hc.apache.org/
HttpClient下載:http://hc.apache.org/downloads.cgi
最新版本4.5 http://hc.apache.org/httpcomponents-client-4.5.x/
官方文檔: http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html
maven地址:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency>
二:httpclient使用流程
使用 HttpClient 發送請求、接收響應很簡單,一般需要如下幾步即可。
- 創建 HttpClient 對象。
- 創建請求方法的實例,並指定請求 URL。如果需要發送 GET 請求,創建 HttpGet 對象;如果需要發送 POST 請求,創建 HttpPost 對象。
- 如果需要發送請求參數,可調用 HttpGet、HttpPost 共同的 setParams(HttpParams params) 方法來添加請求參數;對於 HttpPost 對象而言,也可調用 setEntity(HttpEntity entity) 方法來設置請求參數。
- 調用 HttpClient 對象的 execute(HttpUriRequest request) 發送請求,該方法返回一個 HttpResponse。
- 調用 HttpResponse 的 getAllHeaders()、getHeaders(String name) 等方法可獲取服務器的響應頭;調用 HttpResponse 的 getEntity() 方法可獲取 HttpEntity 對象,該對象包裝了服務器的響應內容。程序可通過該對象獲取服務器的響應內容。
- 釋放連接。無論執行方法是否成功,都必須釋放連接
三:HelloWorld 程序
1.創建helloworld程序
public class HelloWorld2 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.創建httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.創建httpget實例(請求)
HttpGet httpGet = new HttpGet("http://www.java1234.com");
//3.httpclient執行(httpget)請求
CloseableHttpResponse response = httpClient.execute(httpGet); //執行http get請求
//4.獲取返回的實體(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
System.out.println("網頁內容是:"+context);
//5.關閉資源
response.close(); //response關閉
httpClient.close(); //httpClient關閉
}
}
2.創建HttpGet請求
添加依賴
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>fluent-hc</artifactId>
<version>4.5.5</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpmime</artifactId>
<version>4.5.5</version>
</dependency>
public class MyTest {
public static void main(String[] args) {
get();
}
private static void get() {
// 創建 HttpClient 客戶端,打開瀏覽器
CloseableHttpClient httpClient = HttpClients.createDefault();
// 創建 HttpGet 請求,輸入url
HttpGet httpGet = new HttpGet("http://localhost:8080/content/page?draw=1&start=0&length=10");
// 設置長連接
httpGet.setHeader("Connection", "keep-alive");
// 設置代理(模擬瀏覽器版本)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
// 設置 Cookie
httpGet.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");
//發送請求,回車
CloseableHttpResponse httpResponse = null;
try {
// 請求並獲得響應結果
httpResponse = httpClient.execute(httpGet);
HttpEntity httpEntity = httpResponse.getEntity();
// 輸出請求結果
System.out.println(EntityUtils.toString(httpEntity));
} catch (IOException e) {
e.printStackTrace();
} finally { // 無論如何必須關閉連接
if (httpResponse != null) {
try {
httpResponse.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (httpClient != null) {
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
3.創建HttpPost請求
public class MyTest {
public static void main(String[] args) {
post();
}
private static void post() {
// 創建 HttpClient 客戶端
CloseableHttpClient httpClient = HttpClients.createDefault();
// 創建 HttpPost 請求
HttpPost httpPost = new HttpPost("http://localhost:8080/content/page");
// 設置長連接
httpPost.setHeader("Connection", "keep-alive");
// 設置代理(模擬瀏覽器版本)
httpPost.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");
// 設置 Cookie
httpPost.setHeader("Cookie", "UM_distinctid=16442706a09352-0376059833914f-3c604504-1fa400-16442706a0b345; CNZZDATA1262458286=1603637673-1530123020-%7C1530123020; JSESSIONID=805587506F1594AE02DC45845A7216A4");
// 創建 HttpPost 參數
List<BasicNameValuePair> params = new ArrayList<BasicNameValuePair>();
params.add(new BasicNameValuePair("draw", "1")); //請求參數中的key-value值
params.add(new BasicNameValuePair("start", "0"));
params.add(new BasicNameValuePair("length", "10"));
CloseableHttpResponse httpResponse = null;
try {
// 設置 HttpPost 參數
httpPost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
// 輸出請求結果
System.out.println(EntityUtils.toString(httpEntity));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally { // 無論如何必須關閉連接
try {
if (httpResponse != null) {
httpResponse.close();
}
} catch (IOException e) {
e.printStackTrace();
}
try {
if (httpClient != null) {
httpClient.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
四:模擬瀏覽器抓取網頁
1.設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
2.獲取響應內容類型Content-Type
//獲取響應內容類型Content-Type; getName()是獲取key,getValue()是獲取value
entity.getContentType().getValue();
3.獲取響應狀態Status
response.getStatusLine().getStatusCode();
200 -- 正常
403 -- 拒絕
500 -- 服務器報錯
400 -- 未找到頁面
4.示例
public class Demo2 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.創建httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.創建httpget實例(請求)
HttpGet httpGet = new HttpGet("http://www.tuicool.com");
//設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient執行(httpget)請求
CloseableHttpResponse response = httpClient.execute(httpGet); //執行http get請求
System.out.println("Status:"+response.getStatusLine().getStatusCode()); //獲取響應狀態Status
//4.獲取返回的實體(entity)
HttpEntity entity = response.getEntity();
//獲取響應內容類型Content-Type; getName()是獲取key,getValue()是獲取value
System.out.println("Content-Type:"+entity.getContentType().getValue());
//獲取網頁內容
// String context = EntityUtils.toString(entity, "utf-8");
// System.out.println("網頁內容是:"+context);
//5.關閉資源
response.close(); //response關閉
httpClient.close(); //httpClient關閉
}
}
五:httpclient 抓取圖片
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.創建httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.創建httpget實例(請求)
HttpGet httpGet = new HttpGet("http://www.java1234.com/uploads/allimg/161105/1-161105150121954.jpg");
//設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient執行(httpget)請求
CloseableHttpResponse response = httpClient.execute(httpGet); //執行http get請求
//4.獲取返回的實體(entity)
HttpEntity entity = response.getEntity();
if(entity!=null) {
//打印實體的內容類型
System.out.println("Content-Type:"+entity.getContentType().getValue());
//獲取實體的輸入流
InputStream inputStream = entity.getContent();
//將輸入流復制到新建的文件
FileUtils.copyToFile(inputStream, new File("E://mysource/picture/aaa.jpg"));
}
//5.關閉資源
response.close(); //response關閉
httpClient.close(); //httpClient關閉
}
}
六:httpclient 使用代理ip
在爬取網頁的時候,有的目標站點有反爬蟲機制,對於頻繁訪問站點以及規則性訪問站點的行為,會采集屏蔽IP措施。
關於代理IP的話 也分幾種 透明代理、匿名代理、混淆代理、高匿代理。
1.透明代理(Transparent Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
透明代理雖然可以直接“隱藏”你的IP地址,但是還是可以從HTTP_X_FORWARDED_FOR來查到你是誰。
2.匿名代理(Anonymous Proxy)
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
匿名代理比透明代理進步了一點:別人只能知道你用了代理,無法知道你是誰。
3.混淆代理(Distorting Proxies)
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
如上,與匿名代理相同,如果使用了混淆代理,別人還是能知道你在用代理,但是會得到一個假的IP地址,偽裝的更逼真.
4.高匿代理(Elite proxy或High Anonymity Proxy)
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
可以看出來,高匿代理讓別人根本無法發現你是在用代理,所以是最好的選擇.
那代理IP 從哪裏搞呢 很簡單 百度一下,你就知道 一大堆代理IP站點。 一般都會給出一些免費的,但是花點錢搞收費接口更加方便;比如 http://www.66ip.cn/
5.示例
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.創建httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.創建httpget實例(請求)
HttpGet httpGet = new HttpGet("http://www.tuicool.com");
//設置代理ip
HttpHost proxy = new HttpHost("42.121.15.99",3128);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(config);
//設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient執行(httpget)請求
CloseableHttpResponse response = httpClient.execute(httpGet); //執行http get請求
//4.獲取返回的實體(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
System.out.println("網頁內容是:"+context);
//5.關閉資源
response.close(); //response關閉
httpClient.close(); //httpClient關閉
}
}
七:httpclient 連接超時及讀取超時
httpClient在執行具體http請求時候 有一個連接的時間和讀取內容的時間;
HttpClient連接時間,所謂連接的時候 是HttpClient發送請求的地方開始到連接上目標url主機地址的時間。
HttpClient讀取時間,所謂讀取的時間 是HttpClient已經連接到了目標服務器,然後進行內容數據的獲取。
國外maven倉庫地址:http://central.maven.org/maven2/
示例:
public class Demo1 {
public static void main(String[] args) throws ClientProtocolException, IOException {
//1.創建httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.創建httpget實例(請求)
HttpGet httpGet = new HttpGet("http://central.maven.org/maven2/");
//設置連接超時及讀取超時
RequestConfig config=RequestConfig.custom()
.setConnectTimeout(1000) //設置連接超時時間(單位毫秒)
.setSocketTimeout(1000) //設置讀取超時時間(單位毫秒)
.build();
httpGet.setConfig(config);
//設置請求頭消息User-Agent模擬瀏覽器(此處是chrome瀏覽器)
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
//3.httpclient執行(httpget)請求
CloseableHttpResponse response = httpClient.execute(httpGet); //執行http get請求
//4.獲取返回的實體(entity)
HttpEntity entity = response.getEntity();
String context = EntityUtils.toString(entity, "utf-8"); //獲取網頁內容
System.out.println("網頁內容是:"+context);
//5.關閉資源
response.close(); //response關閉
httpClient.close(); //httpClient關閉
}
}
一:HttpClient知識整理