JAVA爬蟲初識之模擬登入
阿新 • • 發佈:2018-12-30
在設計一個爬蟲的時候,在第一步對網站的大概瀏覽瞭解情況是會發現有些網站在訪問之前是需要登入的,否則是無法訪問到有我們需要的資料的子頁面的,這個時候就要在之前的基礎上增加一個模擬登入的步驟。
其實模擬登入的步驟跟之前所說的httpclient基本是一樣的,只不過現在網站登入基本用的是post方法,同時在裡面攜帶登入所需要的引數如賬號密碼,所以我們只需要模擬實際操作,將待爬取網站所需要的引數對應的設定到httppost中,下面以模擬知乎登入為例:
1. 確定登入所需要攜帶的引數:
首先確定登入所需要攜帶的引數,這裡仍用到之前提過的抓包工具fiddler,通過對登入資料的抓取,最後發現需要攜帶以下引數:
之後分析一下這些引數:
- _xsrf:起初我也對這個引數很疑惑具體是哪來的,但經過對登入頁面網頁原始碼的查詢發現這是在每次發起www.zhihu.com請求時在網頁原始碼中攜帶返回的一個引數且每次都是不一樣的如圖
- captcha:顯而易見,驗證碼。
- captcha_type:驗證碼的型別。
- email:賬號。
- password:密碼。
2.獲取引數:
- _xsrf:對https://www.zhihu.com發起請求,下載登入頁面,再直接從頁面取值即可。取值程式碼
/**
* 獲取_xsrf
* getPageHtml()是下載頁面方法
*/
public String get_xsrf("https://www.zhihu.com") {
String page = getPageHtml("https://www.zhihu.com");
Document doc = Jsoup.parse(page);
Elements srfs = doc.getElementsByAttributeValue("name", "_xsrf");
String xsrf = srfs.first().attr("value");
return xsrf;
}
- captcha:驗證碼。起初登入時發現知乎登入用的是中文驗證碼,需要找出倒寫的漢字,傳送到伺服器的引數是滑鼠點選的位置,伺服器會根據位置是否與圖片倒寫漢字位置匹配來判斷正確與否。嫌引數分析麻煩(其實是太菜。。)多次實驗發現知乎還有輸入數字字母的驗證碼,這個就簡單多了而且這種驗證碼少攜帶一個captcha_type引數:
所以只需要將驗證碼圖片下載到本地在對應的設定引數就行了。首先是下載驗證碼圖片,通過對頁面原始碼的分析並沒有找到圖片,那就應該是通過js傳過來的,抓包分析果然找到了載入驗證碼的步驟:
知道了來源,那就簡單了只需要像_xsrf一樣請求相應的地址就行了,然而觀察地址又難住了,這個r引數是哪兒來的。。經過多次抓包發這個r每次都不一樣而且毫無規律可循,於是猜測這可能是個隨機數,通過瀏瀏覽器將這個r改成隨機數請求果然得到了一張驗證碼圖片:
接下來就是按部就班的下載了;得到了圖片第二步就是設定驗證碼引數。由於技術有限無法做到智慧識別輸入,所以選擇從控制檯輸入的方式設定驗證碼引數。。即在首次登入是下載驗證碼到本地之後人工檢視驗證碼之後控制檯輸入驗證碼設定到請求引數中。下載輸入程式碼如下:
/**
* 下載驗證碼到本地
* @param url
* @param desFileName
* @return
* @throws MalformedURLException
*/
public boolean downloaderCapter(String url,String desFileName) throws MalformedURLException {
boolean flag = false;
String page = getPageHtml(url);
Document doc = Jsoup.parse(page);
Elements capchas = doc.select("img.Captcha-image");
System.out.println(capchas.size());
if (capchas.size()==0) {
System.out.println("不需要驗證碼");
}else {
String caurl = "";
//生成隨機數
Random rnd = new Random();
StringBuilder sb = new StringBuilder();
for(int i=0; i < 13; i++) {
sb.append((char)('0' + rnd.nextInt(10)));
}
String id = sb.toString();
System.out.println(id);
caurl = "https://www.zhihu.com/captcha.gif?r="+id+"&type=login";
//下載驗證碼圖片
URL captcha_url = new URL(caurl);
System.out.println(captcha_url);
File file = new File(desFileName);
if (file.exists()) {
file.delete();
}
try {
URLConnection con = captcha_url.openConnection();
InputStream is = con.getInputStream();
// 1K的資料緩衝
byte[] bs = new byte[1024];
// 讀取到的資料長度
int len;
OutputStream os = new FileOutputStream(file);
// 開始讀取
while ((len = is.read(bs)) != -1) {
os.write(bs, 0, len);
}
is.close();
os.close();
flag = true;
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return flag;
}
賬號與密碼:相對應的賬號密碼明文即可。
相對來說,知乎是一個登入比較簡單的網站,登入中只是加了一個可以在網頁中尋找到的字串和驗證碼,對於一些要求比較高的網站,還會對密碼進行加密,以密文的形式作為引數傳送至伺服器,這時候就要了解加密方法對自己的引數進行加密傳送。總的來說就是伺服器需要什麼引數你就相應的給它什麼引數。
3.登入實現:
在獲取所有引數之後就簡單了只要按照httpclient模擬請求就可以了(原以為),程式碼如下:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
import javax.net.ssl.SSLContext;
import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.NameValuePair;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLContextBuilder;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class ZhiHu3 {
private String mainurl = "https://www.zhihu.com";
private String email = "";
private String password = "";
private String _xsrf = "";
boolean daili = false;
HttpClientBuilder httpClientBuilder = HttpClientBuilder.create();
//CloseableHttpClient httpClient = httpClientBuilder.build();
CloseableHttpClient httpClient = createSSLClientDefault();
private HttpHost proxy = new HttpHost("127.0.0.1",8888,"http");
private RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
private String useage = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
private RequestConfig configtime=RequestConfig.custom().setCircularRedirectsAllowed(true).setSocketTimeout(10000).setConnectTimeout(10000).build();
public ZhiHu3() {
}
public ZhiHu3(String email, String password) {
this.email = email;
this.password = password;
}
// client工具函式,信任對方(https)所有證書
private CloseableHttpClient createSSLClientDefault(){
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
//信任所有證書
public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {
return true;
}
}).build();
SSLConnectionSocketFactory sslFactory = new SSLConnectionSocketFactory(sslContext);
return HttpClients.custom().setSSLSocketFactory(sslFactory).build();
} catch (Exception e) {
}
return HttpClients.createDefault();
}
public String getPageHtml(String url) {
String html="";
HttpGet httpget = new HttpGet(url);
httpget.addHeader("User-Agent", useage);
httpget.setConfig(configtime);
try {
CloseableHttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();
html = EntityUtils.toString(entity, "utf-8");
httpget.releaseConnection();
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return html;
}
/**
* 下載驗證碼到本地
* @param url
* @param desFileName
* @return
* @throws MalformedURLException
*/
public boolean downloaderCapter(String url,String desFileName) throws MalformedURLException {
boolean flag = false;
String page = getPageHtml(url);
Document doc = Jsoup.parse(page);
Elements capchas = doc.select("img.Captcha-image");
System.out.println(capchas.size());
if (capchas.size()==0) {
System.out.println("不需要驗證碼");
}else {
String caurl = "";
//生成隨機數
Random rnd = new Random();
StringBuilder sb = new StringBuilder();
for(int i=0; i < 13; i++) {
sb.append((char)('0' + rnd.nextInt(10)));
}
String id = sb.toString();
caurl = "https://www.zhihu.com/captcha.gif?r="+id+"&type=login";
//下載驗證碼圖片
File file = new File(desFileName);
if (file.exists()) {
file.delete();
}
try {
HttpGet getCaptcha = new HttpGet(caurl);
CloseableHttpResponse imageResponse = httpClient.execute(getCaptcha);
byte[] bs = new byte[1024];
int len;
OutputStream os = new FileOutputStream(file);
while ((len = imageResponse.getEntity().getContent().read(bs)) != -1) {
os.write(bs,0,len);
flag = true;
}
os.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return flag;
}
/**
* 獲取_xsrf
*/
public String get_xsrf(String url) {
String page = getPageHtml(url);
Document doc = Jsoup.parse(page);
Elements srfs = doc.getElementsByAttributeValue("name", "_xsrf");
String xsrf = srfs.first().attr("value");
return xsrf;
}
public void login() throws IOException {
List<NameValuePair> para = new ArrayList<NameValuePair>();
_xsrf=get_xsrf(mainurl);
System.out.println(_xsrf);
para.add(new BasicNameValuePair("_xsrf", _xsrf));
Map<String, String> header = new HashMap<String, String>();
header.put("Content-Type", "application/x-www-form-urlencoded");
header.put("Referer","https://www.zhihu.com/");
header.put("User-Agent", useage);
header.put("X-Requested-With", "XMLHttpRequest");
header.put("Host", "www.zhihu.com");
header.put("Origin", "https://www.zhihu.com");
boolean flag = downloaderCapter(mainurl, "D:\\image.png");
if (flag) {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("請輸入驗證碼:");
String captcha = br.readLine();
para.add(new BasicNameValuePair("captcha",captcha));
}
para.add(new BasicNameValuePair("email", email));
para.add(new BasicNameValuePair("password", password));
para.add(new BasicNameValuePair("rememberme", "true"));
HttpPost httppost = new HttpPost("https://www.zhihu.com/login/email");
for (String string : header.keySet()) {
httppost.addHeader(string, header.get(string));
}
httppost.addHeader("X-Xsrftoken", _xsrf);
if (daili) {
httppost.setConfig(config);
}
httppost.setEntity(new UrlEncodedFormEntity(para,"utf-8"));
CloseableHttpResponse res = httpClient.execute(httppost);
int statuts_code = res.getStatusLine().getStatusCode();
System.out.println(statuts_code);
System.out.println(EntityUtils.toString(res.getEntity(),"utf-8"));
httppost.releaseConnection();
}
public static void main(String[] args) {
ZhiHu3 zhihu = new ZhiHu3("[email protected]","xxxxxxxx");
try {
zhihu.login();
String html = zhihu.getPageHtml("https://www.zhihu.com/question/following");
//System.out.println(html);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
原來建立httpclient用的是httpClientBuilder.build()方法,但是在實踐中發現偶爾會報關於SSL證書問題的錯誤,所以加了一個工具函式。登入成功後就可抓取自己想要的資料了,我這裡是在登入成功後訪問我的關注頁面。輸出返回碼為200,表示登入成功。