爬蟲-day02-抓取和分析
阿新 • • 發佈:2018-05-09
https baidu gzip ace .text python htm conn code
###頁面抓取###
1、urllib3
是一個功能強大且好用的HTTP客戶端,彌補了Python標準庫中的不足
安裝: pip install urllib3
使用:
import urllib3 http = urllib3.PoolManager() response = http.request(‘GET‘, ‘http://news.qq.com‘) print(response.headers) result = response.data.decode(‘gbk‘) print(result)
發送HTTPS協議的請求
安裝依賴 : pip install certifi
import certifi import urllib3 http = urllib3.PoolManager(cert_reqs = ‘CERT_REQUIRED‘, ca_certs = certifi.where()) #添加證書 resp = http.request(‘GET‘, ‘http://news.baidu.com/‘) print(resp.data.decode(‘utf-8‘))
####帶上參數
import urllib3 from urllib.parse import urlencode http = urllib3.PoolManager() args= {‘wd‘ : ‘人民幣‘} # url = ‘http://www.baidu.com/s?%s‘ % (args) url = ‘http://www.baidu.com/s?%s‘ % (urlencode(args)) print(url) # resp = http.request(‘GET‘ , url) # print(resp.data.decode(‘utf-8‘)) headers = { ‘Accept‘ : ‘text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01‘, ‘Accept-Encoding‘ : ‘gzip, deflate, br‘, ‘Accept-Language‘ : ‘zh-CN,zh;q=0.9‘, ‘Connection‘ : ‘keep-alive‘, ‘Host‘ : ‘www.baidu.com‘, ‘Referer‘ : ‘https://www.baidu.com/s?wd=人民幣‘, ‘User-Agent‘ : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" } resp8 = requests.get(url8, fields=args8, headers=headers8) print(resp8.text)
爬蟲-day02-抓取和分析