HttpClient設定HTTP請求頭Header
分享一下我老師大神的人工智慧教程!零基礎,通俗易懂!http://blog.csdn.net/jiangjunshow
也歡迎大家轉載本篇文章。分享知識,造福人民,實現我們中華民族偉大復興!
用Firebug對POST的資料進行監控 請求 HTTP頭 資訊,得到如下內容:
Java程式碼- Accept application/json, text/javascript, */*
- Accept-Encoding gzip, deflate
- Accept-Language en-us,en;q=0.5
- Cache-Control no-cache
- Content-Length 432
- Content-Type application/x-www-form-urlencoded; charset=UTF-8
- Host www.huaxixiang.com
- Pragma no-cache
- Proxy-Connection keep-alive
- Refere http://www.huaxixiang.com/CrossPriceDetail.action
- User-Agent Mozilla/5.0
- X-Requested-With XMLHttpRequest
用HttpClient模仿瀏覽器訪問頁面,載入URL的HTML資訊,為了良好的載入網站的資訊,不被限制.
為了說明請求頭的資訊添加了一個小測試專案LoginTest,新增頁面index.jsp,新增主要程式碼列印Http Header的JSP頁面.
主要列印Http Header資訊.
1. index.jsp
Java程式碼
- out.println("Protocol: " + request.getProtocol());
- out.println("Scheme: " + request.getScheme());
- out.println("Server Name: " + request.getServerName() );
- out.println("Server Port: " + request.getServerPort());
- out.println("Protocol: " + request.getProtocol());
- out.println("Server Info: " + getServletConfig().getServletContext().getServerInfo());
- out.println("Remote Addr: " + request.getRemoteAddr());
- out.println("Remote Host: " + request.getRemoteHost());
- out.println("Character Encoding: " + request.getCharacterEncoding());
- out.println("Content Length: " + request.getContentLength());
- out.println("Content Type: "+ request.getContentType());
- out.println("Auth Type: " + request.getAuthType());
- out.println("HTTP Method: " + request.getMethod());
- out.println("Path Info: " + request.getPathInfo());
- out.println("Path Trans: " + request.getPathTranslated());
- out.println("Query String: " + request.getQueryString());
- out.println("Remote User: " + request.getRemoteUser());
- out.println("Session Id: " + request.getRequestedSessionId());
- out.println("Request URI: " + request.getRequestURI());
- out.println("Servlet Path: " + request.getServletPath());
- out.println("Accept: " + request.getHeader("Accept"));
- out.println("Host: " + request.getHeader("Host"));
- out.println("Referer : " + request.getHeader("Referer"));
- out.println("Accept-Language : " + request.getHeader("Accept-Language"));
- out.println("Accept-Encoding : " + request.getHeader("Accept-Encoding"));
- out.println("User-Agent : " + request.getHeader("User-Agent"));
- out.println("Connection : " + request.getHeader("Connection"));
- out.println("Cookie : " + request.getHeader("Cookie"));
- out.println("Created : " + session.getCreationTime());
- out.println("LastAccessed : " + session.getLastAccessedTime());
2. 使用IE瀏覽器載入http://127.0.0.1:8080/LoginTest/index.jsp返回內容如下:
Java程式碼
- Protocol: HTTP/1.1
- Scheme: http
- Server Name: 127.0.0.1
- Server Port: 8080
- Protocol: HTTP/1.1
- Server Info: Apache Tomcat/6.0.18
- Remote Addr: 127.0.0.1
- Remote Host: 127.0.0.1
- Character Encoding: null
- Content Length: -1
- Content Type: null
- Auth Type: null
- HTTP Method: GET
- Path Info: null
- Path Trans: null
- Query String: null
- Remote User: null
- Session Id: E2C384C095E34AD355684EB554517FB1
- Request URI: /LoginTest/index.jsp
- Servlet Path: /index.jsp
- Accept: */*
- Host: 127.0.0.1:8080
- Referer : null
- Accept-Language : en-us
- Accept-Encoding : gzip, deflate
- User-Agent : Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.3; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E)
- Connection : Keep-Alive
- Cookie : JSESSIONID=E2C384C095E34AD355684EB554517FB1
- Created : 1322294859981
- LastAccessed : 1322294859981
3. 後面使用HttpClient不設定header資訊載入http://127.0.0.1:8080/LoginTest/index.jsp資訊如下:
Java程式碼
- Protocol: HTTP/1.1
- Scheme: httpServer
- Name: 127.0.0.1
- Server Port: 8080
- Protocol: HTTP/1.1
- Server Info: Apache Tomcat/6.0.18
- Remote Addr: 127.0.0.1
- Remote Host: 127.0.0.1
- Character Encoding: null
- Content Length: -1
- Content Type: null
- Auth Type: null
- HTTP Method: GET
- Path Info: null
- Path Trans: null
- Query String: null
- Remote User: null
- Session Id: null
- Request URI: /LoginTest/index.jspServlet
- Path: /index.jsp
- Accept: null
- Host: 127.0.0.1:8080
- Referer : null
- Accept-Language : null
- Accept-Encoding : null
- User-Agent : Apache-HttpClient/4.1.1 (java 1.5)
- Connection : Keep-Alive
- Cookie : null
- Created : 1322293843369
- LastAccessed : 1322293843369
分析: 由於這裡純粹載入頁面,沒有動用CookieStore自動管理Cookie,在上面沒有能顯示Cookie,SessionID的資訊,區別於瀏覽器的的User-Agent,Cookie,SessionID,Accept,Accept-Language,Accept-Encoding等資訊都沒有進行設定.
對於爬取網站在HttpClient中設定Host,Referer,User-Agent,Connection,Cookie和爬取的頻率和入口Url有講究.
4. 考慮設定HttpClient的Header資訊程式碼:
Java程式碼
- HashMap<String, String> headers = new HashMap<String, String>();
- headers.put("Referer", p.url);
- headers.put("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.6) Gecko/20100625
- Firefox/3.6.6 Greatwqs");
- headers.put("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
- headers.put("Accept-Language","zh-cn,zh;q=0.5");
- headers.put("Host","www.yourdomain.com");
- headers.put("Accept-Charset","ISO-8859-1,utf-8;q=0.7,*;q=0.7");
- headers.put("Referer", "http://www.yourdomian.com/xxxAction.html");
- HttpRequestBase httpget = ......
- httpget.setHeaders(headers);
5. 由新的HttpClient執行http://127.0.0.1:8080/LoginTest/index.jsp得到的HTML資訊如下:
Java程式碼
- Protocol: HTTP/1.1
- Scheme: http
- Server Name: www.yourdomain.com
- Server Port: 80
- Protocol: HTTP/1.1
- Server Info: Apache Tomcat/6.0.18
- Remote Addr: 127.0.0.1
- Remote Host: 127.0.0.1
- Character Encoding: null
- Content Length: -1
- Content Type: null
- Auth Type: null
- HTTP Method: GET
- Path Info: null
- Path Trans: null
- Query String: null
- Remote User: null
- Session Id: null
- Request URI: /LoginTest/index.jsp
- Servlet Path: /index.jsp
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Host: www.yourdomain.com
- Referer : http://www.yourdomian.com/xxxAction.html
- Accept-Language : zh-cn,zh;q=0.5
- Accept-Encoding : null
- User-Agent : Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6
- Greatwqs
- Connection : Keep-Alive
- Cookie : null
- Created : 1322294148709
- LastAccessed : 1322294148709