網頁爬蟲-通過已登入後的cookie,模擬登陸狀態,保持會話進行後續操作
阿新 • • 發佈:2018-11-28
剛開始的時候打算使用java程式直接登陸網站在進行後續操作,後來發現有些網站的重定向太多不好操作,
所以改用已登入的cookie 來保持會話,
使用方式很簡單,只需要在瀏覽器上登入你要操作的網站,然後獲取cookie值,將cookie放到程式裡就實現了儲存會話的功能了,
1、新增maven 依賴
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.1.2</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient-cache</artifactId> <version>4.1.2</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpmime</artifactId> <version>4.1.2</version> </dependency>
2、登入網站,然後輸入賬號密碼登入,按下F12功能鍵(或者在網頁空白處右擊->檢查),
然後會彈出瀏覽器的除錯頁面--> 網路 --> 訊息頭 --> 下面的cookie一欄就是我們要的值了,先把它複製
3、上程式碼,header新增剛才複製的cookie值
package com.html; import java.util.HashMap; import java.util.Map; import org.apache.http.Header; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.message.BasicHeader; import org.apache.http.util.EntityUtils; public class HtmlRequest { public static void main(String[] args) { //需要爬資料的網頁url String url = "http://www.baidu.com"; Map<String, String> header = new HashMap<String, String>(); //將瀏覽器的cookie複製到這裡 header.put("Cookie", "Hm_lvt_9b04b6953ffb9872983f02eee2929d23=1536065531; Hm_lpvt_9b04b6953ffb9872983f02eee2929d23=1536066013; PHPSESSID=eemc0vbbmbr57d6s9rhvh6rav7; DedeUserID=6745; DedeUserID__ckMd5=ac9f8bd4bd2227be; DedeLoginTime=1536068126; DedeLoginTime__ckMd5=81e100997d01fca8"); System.out.println(httpGet(url, null, header)); } /** * 傳送 get 請求 * * @param url * @param encode * @param headers * @return */ public static String httpGet(String url, String encode, Map<String, String> headers) { if (encode == null) { encode = "utf-8"; } String content = null; DefaultHttpClient httpclient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); // 設定 header Header headerss[] = buildHeader(headers); if (headerss != null && headerss.length > 0) { httpGet.setHeaders(headerss); } HttpResponse http_response; try { http_response = httpclient.execute(httpGet); HttpEntity entity = http_response.getEntity(); content = EntityUtils.toString(entity, encode); } catch (Exception e) { e.printStackTrace(); } finally { //斷開連線 // httpGet.releaseConnection(); } return content; } /** * 組裝請求頭 * * @param params * @return */ public static Header[] buildHeader(Map<String, String> params) { Header[] headers = null; if (params != null && params.size() > 0) { headers = new BasicHeader[params.size()]; int i = 0; for (Map.Entry<String, String> entry : params.entrySet()) { headers[i] = new BasicHeader(entry.getKey(), entry.getValue()); i++; } } return headers; } }
4、然後直接執行main方法,打印出已登入網站的html頁面資料,
總結:使用java實現自動登入並且進行後續操作的功能目前還在研究中,一天進步一點點,rmb在向你招手,
宣傳下個人網站:www.huashuku.top