Python爬蟲-破解JS加密的Cookie
原文出處 jhao104
現在很多網站為了防範爬蟲,做了很多反扒處理,同樣對於開發者來講,上有政策,下有對策,於是今天來個破解反扒處理,僅供學習參考。
前言
在GitHub上維護了一個代理池的專案,代理來源是抓取一些免費的代理髮佈網站。上午有個小哥告訴我說有個代理抓取介面不能用了,返回狀態521。抱著幫人解決問題的心態去跑了一遍程式碼。發現果真是這樣。
通過Fiddler抓包比較,基本可以確定是JavaScript生成加密Cookie導致原來的請求返回521。
發現問題
開啟Fiddler軟體,用瀏覽器開啟目標站點(http://www.kuaidaili.com/proxylist/2/
請求兩次
仔細觀察兩次返回的結果可以發現:
第一次請求
第二次請求
1、第二次請求比第一次請求的Cookie內容多了個這個 _ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971
2、第一次返回的內容一些複雜看不懂的JS程式碼,第二次返回的就是正確的內容
其實這是網站反爬蟲的常用手段。大致過程是這樣的:首次請求資料時,服務端返回動態的混淆加密過的JS,而這段JS的作用是給Cookie新增新的內容用於服務端驗證,此時返回的狀態碼是521。瀏覽器帶上新的Cookie再次請求,服務端驗證Cookie通過返回資料(這也是為嘛程式碼不能返回資料的原因)。
解決問題
既然你用JS生成Cookie, 那麼我也可以將JS函式翻譯成Python版本,原始JS是這樣的:
function lq(VA) { var qo, mo = "", no = "", oo = [0x8c, 0xcd, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x44, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xdd, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xbb, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x75, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0xdf, 0xbc, 0x31, 0x1e, 0xf6, 0xbf, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xbe, 0x37, 0xdf, 0xd0, 0xbd, 0xb9, 0x36, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0xcc, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xab, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0xdd, 0x63, 0xe3, 0x57, 0x05, 0x82, 0xff, 0xcc, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9c, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xee, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xfe, 0x48, 0xcd, 0x4b, 0xcc, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0xbf, 0x16, 0x88, 0x93, 0xdd, 0x3b]; qo = "qo=241; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>3)|((oo[qo]<<5)&0xff))-70)&0xff;} while(--qo>=2);"; eval(qo); qo = 240; do { oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff; } while (--qo >= 3); qo = 1; for (; ;) { if (qo > 240) break; oo[qo] = ((((((oo[qo] + 2) & 0xff) + 76) & 0xff) << 1) & 0xff) | (((((oo[qo] + 2) & 0xff) + 76) & 0xff) >> 7); qo++; } po = ""; for (qo = 1; qo < oo.length - 1; qo++) if (qo % 6) po += String.fromCharCode(oo[qo] ^ VA); eval("qo=eval;qo(po);"); }
翻譯成Python版本是這樣的 (僅供參考):
def la(va):
mo, no, oo = "", "0", [0x8c, 0xcd, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x44, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xdd, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xbb, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x75, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0xdf, 0xbc, 0x31, 0x1e, 0xf6, 0xbf, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xbe, 0x37, 0xdf, 0xd0, 0xbd, 0xb9, 0x36, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0xcc, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xab, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0xdd, 0x63, 0xe3, 0x57, 0x05, 0x82, 0xff, 0xcc, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9c, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xee, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xfe, 0x48, 0xcd, 0x4b, 0xcc, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0xbf, 0x16, 0x88, 0x93, 0xdd, 0x3b]
qo = 241
oo[qo] = (-oo[qo]) & 0xff
oo[qo] = (((oo[qo] >> 3) | ((oo[qo] << 5) & 0xff)) - 70) & 0xff
qo -= 1
while qo >= 2:
oo[qo] = (-oo[qo]) & 0xff
oo[qo] = (((oo[qo] >> 3) | ((oo[qo] << 5) & 0xff)) - 70) & 0xff
qo -= 1
qo = 240
oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff
qo -= 1
while qo >= 3:
oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff
qo -= 1
qo = 1
while qo <= 240:
oo[qo] = ((((((oo[qo] + 2) & 0xff) + 76) & 0xff) << 1) & 0xff) | (((((oo[qo] + 2) & 0xff) + 76) & 0xff) >> 7)
qo += 1
po = ""
for qo in range(1, len(oo)):
if qo % 6:
po += chr(oo[qo]^va)
return po,print(la(82))
如果沒能力翻譯,其實還有種方法可解,那就是利用瀏覽器的JS程式碼除錯功能。這樣一切就迎刃而解,新建一個html檔案,將第一次返回的html原文複製進去,儲存用瀏覽器開啟,在eval之前打上斷點,看到這樣的輸出:
加密JS程式碼除錯
可以看到這個變數po為
document.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL
下面還有個eval("qo=eval;qo(po);")
。
JS裡面的eval和Python的差不多,第二句的意思就是將eval方法賦給qo。然後去eval字串po。而字串po的前半段的意思是給瀏覽器新增Cooklie,後半段window.document.location=document.URL
是重新整理當前頁面。
這也印證了我上面的說法,首次請求沒有Cookie,服務端回返回一段生成Cookie並自動重新整理的JS程式碼。瀏覽器拿到程式碼能夠成功執行,帶著新的Cookie再次請求獲取資料。而Python拿到這段程式碼就只能停留在第一步。
那麼如何才能使Python也能執行這段JS呢,答案是PyV8。V8是Chromium中內嵌的javascript引擎,號稱跑的最快。PyV8是用Python在V8的外部API包裝了一個python殼,這樣便可以使python可以直接與javascript操作。PyV8的安裝大家可以自行百度。
程式碼
分析完成,下面切入正題擼程式碼。
首先是正常請求網頁,返回帶加密的JS函式的html:
import re
import PyV8
import requests
TARGET_URL = "http://www.kuaidaili.com/proxylist/1/"
def getHtml(url, cookie=None):
header = {
"Host": "www.kuaidaili.com",
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
html = requests.get(url=url, headers=header, timeout=30, cookies=cookie).content
return html
# 第一次訪問獲取動態加密的JS
first_html = getHtml(TARGET_URL)
由於返回的是html,並不單純的JS函式,所以需要用正則提取JS函式的引數的引數。
第一次返回內容
# 提取其中的JS加密函式
js_func = ''.join(re.findall(r'(function .*?)</script>', first_html))
print 'get js func:\n', js_func
# 提取其中執行JS函式的引數
js_arg = ''.join(re.findall(r'setTimeout\(\"\D+\((\d+)\)\"', first_html))
print 'get ja arg:\n', js_arg
還有一點需要注意,在JS函式中並沒有返回cookie,而是直接將cookie set到瀏覽器,所以我們需要將eval("qo=eval;qo(po);")
替換成return po
。這樣就能成功返回po中的內容。
def executeJS(js_func_string, arg):
ctxt = PyV8.JSContext()
ctxt.enter()
func = ctxt.eval("({js})".format(js=js_func_string))
return func(arg)
# 修改JS函式,使其返回Cookie內容
js_func = js_func.replace('eval("qo=eval;qo(po);")', 'return po')
# 執行JS獲取Cookie
cookie_str = executeJS(js_func, js_arg)
這樣返回的cookie是字串格式,但是用requests.get()需要字典形式,所以將其轉換成字典:
def parseCookie(string):
string = string.replace("document.cookie='", "")
clearance = string.split(';')[0]
return {clearance.split('=')[0]: clearance.split('=')[1]}
# 將Cookie轉換為字典格式
cookie = parseCookie(cookie_str)
最後帶上解析出來的Cookie再次訪問網頁,成功獲取資料:
# 帶上Cookie再次訪問url,獲取正確資料
print getHtml(TARGET_URL, cookie)[0:500]
下面是完整程式碼:
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
Description : Python爬蟲—破解JS加密的Cookie
快代理網站為例:http://www.kuaidaili.com/proxylist/1/
-------------------------------------------------
"""
import re
import PyV8
import requests
TARGET_URL = "http://www.kuaidaili.com/proxylist/1/"
def getHtml(url, cookie=None):
header = {
"Host": "www.kuaidaili.com",
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
}
html = requests.get(url=url, headers=header, timeout=30, cookies=cookie).content
return html
def executeJS(js_func_string, arg):
ctxt = PyV8.JSContext()
ctxt.enter()
func = ctxt.eval("({js})".format(js=js_func_string))
return func(arg)
def parseCookie(string):
string = string.replace("document.cookie='", "")
clearance = string.split(';')[0]
return {clearance.split('=')[0]: clearance.split('=')[1]}
# 第一次訪問獲取動態加密的JS
first_html = getHtml(TARGET_URL)
# first_html = """
# <html><body><script language="javascript"> window.onload=setTimeout("lu(158)", 200); function lu(OE) {var qo, mo="", no="", oo = [0x64,0xaa,0x98,0x3d,0x56,0x64,0x8b,0xb0,0x88,0xe1,0x0d,0xf4,0x99,0x31,0xd8,0xb6,0x5d,0x73,0x98,0xc3,0xc4,0x7a,0x1e,0x38,0x9d,0xe8,0x8d,0xe4,0x0a,0x2e,0x6c,0x45,0x69,0x41,0xe5,0xd0,0xe5,0x11,0x0b,0x35,0x7b,0xe4,0x09,0xb1,0x2b,0x6d,0x82,0x7c,0x25,0xdd,0x70,0x5a,0xc4,0xaa,0xd3,0x74,0x98,0x42,0x3c,0x60,0x2d,0x42,0x66,0xe0,0x0a,0x2e,0x96,0xbb,0xe2,0x1d,0x38,0xdc,0xb1,0xd6,0x0e,0x0d,0x76,0xae,0xc3,0xa9,0x3b,0x62,0x47,0x40,0x15,0x93,0xb7,0xee,0xc3,0x3e,0xfd,0xd3,0x0d,0xf6,0x61,0xdc,0xf1,0x2c,0x54,0x8c,0x90,0xfa,0x24,0x5b,0x83,0x0c,0x75,0xaf,0x18,0x01,0x7e,0x68,0xe0,0x0a,0x72,0x1e,0x88,0x33,0xa7,0xcc,0x31,0x9b,0xf3,0x1a,0xf2,0x9a,0xbf,0x58,0x83,0xe4,0x87,0xed,0x07,0x7e,0xe2,0x00,0xe9,0x92,0xc9,0xe8,0x59,0x7d,0x56,0x8d,0xb5,0xb2,0x6c,0xe0,0x49,0x73,0xfc,0xe7,0x20,0x49,0x34,0x09,0x71,0xeb,0x60,0xfd,0x8e,0xad,0x0f,0xb9,0x2e,0x77,0xdc,0x74,0x9b,0xbf,0x8f,0xa5,0x8d,0xb8,0xb0,0x06,0xac,0xc5,0xe9,0x10,0x12,0x77,0x9b,0xb1,0x19,0x4e,0x64,0x5c,0x00,0x98,0xc6,0xed,0x98,0x0d,0x65,0x11,0x35,0x9e,0xf4,0x30,0x93,0x4b,0x00,0xab,0x20,0x8f,0x29,0x4f,0x27,0x8c,0xc2,0x6a,0x04,0xfb,0x51,0xa3,0x4b,0xef,0x09,0x30,0x28,0x4d,0x25,0x8e,0x76,0x58,0xbf,0x57,0xfb,0x20,0x78,0xd1,0xf7,0x9f,0x77,0x0f,0x3a,0x9f,0x37,0xdb,0xd3,0xfc,0x14,0x39,0x11,0x3b,0x94,0x8c,0xad,0x8e,0x5c,0xd3,0x3b];qo = "qo=251; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>4)|((oo[qo]<<4)&0xff))-0)&0xff;} while(--qo>=2);"; eval(qo);qo = 250; do { oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff; } while (-- qo >= 3 );qo = 1; for (;;) { if (qo > 250) break; oo[qo] = ((((((oo[qo] + 200) & 0xff) + 121) & 0xff) << 6) & 0xff) | (((((oo[qo] + 200) & 0xff) + 121) & 0xff) >> 2); qo++;}po = ""; for (qo = 1; qo < oo.length - 1; qo++) if (qo % 5) po += String.fromCharCode(oo[qo] ^ OE);eval("qo=eval;qo(po);");} </script> </body></html>
# """
# 提取其中的JS加密函式
js_func = ''.join(re.findall(r'(function .*?)</script>', first_html))
print 'get js func:\n', js_func
# 提取其中執行JS函式的引數
js_arg = ''.join(re.findall(r'setTimeout\(\"\D+\((\d+)\)\"', first_html))
print 'get ja arg:\n', js_arg
# 修改JS函式,使其返回Cookie內容
js_func = js_func.replace('eval("qo=eval;qo(po);")', 'return po')
# 執行JS獲取Cookie
cookie_str = executeJS(js_func, js_arg)
# 將Cookie轉換為字典格式
cookie = parseCookie(cookie_str)
print cookie
# 帶上Cookie再次訪問url,獲取正確資料
print getHtml(TARGET_URL, cookie)[0:500]
微信掃碼關注:獲取最新程式碼資源
AI技術與生活