python---get/post請求下載指定URL返回的網頁內容,出現gzip亂碼處理。設定Accept-Encoding為gzip,deflate,返回的網頁是亂碼
阿新 • • 發佈:2019-01-07
python—get/post請求下載指定URL返回的網頁內容,出現gzip亂碼處理。設定Accept-Encoding為gzip,deflate,返回的網頁是亂碼
1、指令碼
# --*-- coding:utf-8 --*--
#coding:utf-8
import string
import urllib
import urllib2
import ssl
def getpicyanzhengma():#實時請求伺服器最新的驗證碼,並儲存pic.png圖片格式,與伺服器互動
urlget = "https://xianzhi.aliyun.com/forum/topic/1805/"
#ctl = {"ctl":"code"}
#ctldata = urllib.urlencode(ctl)
#reqget = urllib2.Request(urlget+'?'+ctldata)#構造get請求與引數
reqget = urllib2.Request(urlget)#構造get請求與引數
#新增get請求的頭資訊
reqget.add_header("Host","xianzhi.aliyun.com")
reqget.add_header("Cache-Control","max-age=0")
reqget.add_header("Upgrade-Insecure-Requests" ,"1")
reqget.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36")
reqget.add_header("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
reqget.add_header("Accept-Language","zh-CN,zh;q=0.8" )
reqget.add_header("Accept-Encoding","gzip, deflate, sdch, br")
reqget.add_header("Cookie","cnz=X6ejEtcUBVMCAfJ77XgdkdPH; cna=YKejEpKOFU0CAXjte/LuiuWB; UM_distinctid=16000343ca4183-0e8093cc5e7b3-64191279-15f900-16000343ca575a; _uab_collina=151183659981086744617448; _ga=GA1.2.668866163.1511831906; aliyun_country=CN; aliyun_site=CN; isg=ApmZtNphJydPxfuAkp4Fb9c1qIWzjqX8QOIT1rtOAUA_wrlUA3adqAfSsrFO; _umdata=ED82BDCEC1AA6EB94F984760A4C6465E6DD138CC3777AF0CB131A783FCB0E006227E021A199C6A8DCD43AD3E795C914C3303D9E6CB380052D470743247B79D15; acw_tc=AQAAAJMuFXttQgkA8nvteBqARscCdcug; csrftoken=CkpJbhBYBvg6oTBvrwTrsrYcsF1SJXC4mdv0A0k1BmX6mDFT0K2izVlfJkaZI4zx; CNZZDATA1260716569=1195371503-1511830276-https%253A%252F%252Fwww.baidu.com%252F%7C1515457887")
reqget.add_header("Connection","keep-alive")
#使用本機進行代理抓包,檢視詳細的資料包
#proxy_handler = urllib2.ProxyHandler({'http': '192.168.40.36:4455'})
#opener = urllib2.build_opener(proxy_handler)
#urllib2.install_opener(opener)
context = ssl._create_unverified_context()#啟用ssl。如果是http的話此行去除
resget = urllib2.urlopen(reqget,context=context)#在urllib2啟用ssl欄位,開啟請求的資料。如果是http的話此
resgetdata = resget.read()
print resgetdata
#對get請求的資料回包的圖片驗證碼資料,儲存為pic.png的圖片
f = open("e:/pic/downloadxianzhi.html","wb")
f.write(resgetdata)
f.close()
getpicyanzhengma()
2、執行指令碼發現亂碼
python使用get方式請求頁面時,返回頁面內容是亂碼
��<鵶壑�?�3�? �4挧OQW$礩'鄶蚽移澗懍�(�%+mf鐲謒蓫,_�!踨':\'柆��%@顓�� 奛獫dv9嘟飣魺 x踅脀櫖憮N翎F鏀窿R"�餑�賤r揉!薸:2�##胿�z螑 榗妍+邇嫣N_�;釞琾9��.hR迱T%�猙 鄖鐍�7C氹撴鬲5U礀6瑭菮糰 嶄U蛨�3翦�慏#�/[email protected],鵴JR$C鈊V8�'ⅠS98�+浼G閣uG :胰O巼#婈�.K��!�?" 槩瑔2龖XF� 箻np�$釀橷�茻Qx�0苃P梤� 姖g蒐洸譟杫1�1*#漚Yz個FZ匴UC74.偄偖G(^T!肶崇\ L$J囉Esb噘縭⒒@Sx擣�7b� ��%醜pa觵@€溼��肏摴褟餚�楚i斀*尲\�4OFy鮸燔_ H�:�=b|e�?�)3Ja礌挘ガ嗶吉枰0jΠ甎麵�0瞾橑辝��<�{�&尞 龖琣鋥c1AQ�&VPs6輑"欻DSd眘€p_孨u颫Hヌ�搒謡w�<�⒊淕瓜q�=鴫>�;�'M�籵淚D� �憅ZU�$撮L靠h溳 絬窶^)6錮I聖]�)
註釋欄位:
#reqget.add_header("Accept-Encoding","gzip, deflate, sdch, br")
執行情況
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>先知社群</title>
<meta name="description" content="先知社群,先知安全技術社群">
<meta name="viewport"
content="width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no">
<link rel="icon" href="/forum/static/icon/favicon.ico" type="image/x-icon">
<!-- Le styles -->
....................
.....................
.....................
......................
4、解決html回包的亂碼方式:
想要獲得正確網頁內容,而非亂碼的話,就有兩種方式了:
1.不要設定Accept-Encoding的Header
//req.Headers.Add("Accept-Encoding", "gzip,deflate");
2.設定Accept-Encoding的Header,同時設定對應的自動解壓縮的模式
req.Headers["Accept-Encoding"] = "gzip,deflate";
req.AutomaticDecompression = DecompressionMethods.GZip;
具體採用哪種方法,自己根據需要選擇。