爬蟲之requests模塊
引入
在學習爬蟲之前可以先大致的了解一下HTTP協議~
HTTP協議:https://www.cnblogs.com/peng104/p/9846613.html
爬蟲的基本流程
簡介
簡介:Requests是用python語言基於urllib編寫的,采用的是Apache2 Licensed開源協議的HTTP庫,Requests它會比urllib更加方便,可以節約我們大量的工作。一句話,requests是python實現的最簡單易用的HTTP庫,建議爬蟲使用requests庫。默認安裝好python之後,是沒有安裝requests模塊的,需要單獨通過pip安裝
安裝方法:pip install requests
開源地址:https://github.com/kennethreitz/requests
中文文檔 API: http://docs.python-requests.org/zh_CN/latest/index.html
基本語法
requests模塊支持的請求:
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.delete("http://httpbin.org/delete") requests.head("http://httpbin.org/get") requests.options("http://httpbin.org/get")
get請求
1. 基本請求
import requests response=requests.get(‘https://www.jd.com/‘,) with open("jd.html","wb") as f: f.write(response.content)
2. 含參數請求
import requests response=requests.get(‘https://s.taobao.com/search?q=手機‘) response=requests.get(‘https://s.taobao.com/search‘,params={"q":"三只松鼠"})
3. 含請求頭
import requests response=requests.get(‘https://dig.chouti.com/‘, headers={ ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36‘, } )
4. 含cookies請求
import uuid import requests url = ‘http://httpbin.org/cookies‘ cookies = dict(sbid=str(uuid.uuid4())) res = requests.get(url, cookies=cookies) print(res.text)
5. request.session()
import requests session=requests.session() res1=session.get("https://www.zhihu.com/explore") print(session.cookies.get_dict()) res2=session.get("https://www.zhihu.com/question/30565354/answer/463324517",cookies={"abs":"123"}
post請求
1. data參數
requests.post()用法與requests.get()完全一致,特殊的是requests.post()多了一個data參數,用來存放請求體數據
response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"peng"})
2. 發送json數據
import requests res1=requests.post(url=‘http://httpbin.org/post‘, data={‘name‘:‘yuan‘}) #沒有指定請求頭,#默認的請求頭:application/x-www-form-urlencoed print(res1.json()) res2=requests.post(url=‘http://httpbin.org/post‘,json={‘age‘:"22",}) #默認的請求頭:application/json print(res2.json())
response對象
1. 常見屬性
import requests respone=requests.get(‘https://sh.lianjia.com/ershoufang/‘) # respone屬性 print(respone.text) print(respone.content) print(respone.status_code) print(respone.headers) print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) print(respone.url) print(respone.history) print(respone.encoding)
2. 編碼問題
import requests response=requests.get(‘http://www.autohome.com/news‘) #response.encoding=‘gbk‘ #汽車之家網站返回的頁面內容為gb2312編碼的,而requests的默認編碼為ISO-8859-1,如果不設置成gbk則中文亂碼 with open("res.html","w") as f: f.write(response.text)
3. 下載二進制文件(圖片,視頻,音頻)
import requests response=requests.get(‘http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg‘) with open("res.png","wb") as f: # f.write(response.content) # 比如下載視頻時,如果視頻100G,用response.content然後一下子寫到文件中是不合理的 for line in response.iter_content(): f.write(line)
4. 解析json數據
import requests import json response=requests.get(‘http://httpbin.org/get‘) res1=json.loads(response.text) #太麻煩 res2=response.json() #直接獲取json數據 print(res1==res2)
5. Redirection and History
默認情況下,除了 HEAD, Requests 會自動處理所有重定向。可以使用響應對象的 history 方法來追蹤重定向。Response.history 是一個 Response 對象的列表,為了完成請求而創建了這些對象。這個對象列表按照從最老到最近的請求進行排序。
>>> r = requests.get(‘http://github.com‘) >>> r.url ‘https://github.com/‘ >>> r.status_code 200 >>> r.history [<Response [301]>]
另外,還可以通過 allow_redirects 參數禁用重定向處理:
>>> r = requests.get(‘http://github.com‘, allow_redirects=False) >>> r.status_code 301 >>> r.history []
進階用法
proxies代理
免費代理
如果需要使用代理,你可以通過為任意請求方法提供 proxies 參數來配置單個請求:
import requests # 根據協議類型,選擇不同的代理 proxies = { "http": "http://12.34.56.79:9527", "https": "http://12.34.56.79:9527", } response = requests.get("http://www.baidu.com", proxies = proxies) print(response.text)
也可以通過本地環境變量 HTTP_PROXY 和 HTTPS_PROXY 來配置代理:
export HTTP_PROXY="http://12.34.56.79:9527" export HTTPS_PROXY="https://12.34.56.79:9527"
私密代理
import requests # 如果代理需要使用HTTP Basic Auth,可以使用下面這種格式: proxy = { "http": "mr_mao_hacker:[email protected]:16816" } response = requests.get("http://www.baidu.com", proxies = proxy) print(response.text)
web客戶端驗證
如果是Web客戶端驗證,需要添加 auth = (賬戶名, 密碼)
import requests auth=(‘test‘, ‘123456‘) response = requests.get(‘http://192.168.199.107‘, auth = auth) print(response.text)
兩個栗子
1、模擬GitHub登錄,獲取登錄信息
import requests import re #請求1: r1=requests.get(‘https://github.com/login‘) r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授權) authenticity_token=re.findall(r‘name="authenticity_token".*?value="(.*?)"‘,r1.text)[0] #從頁面中拿到CSRF TOKEN print("authenticity_token",authenticity_token) #第二次請求:帶著初始cookie和TOKEN發送POST請求給登錄頁面,帶上賬號密碼 data={ ‘commit‘:‘Sign in‘, ‘utf8‘:‘?‘, ‘authenticity_token‘:authenticity_token, ‘login‘:‘你的github賬號?‘, ‘password‘:‘你的密碼‘ } #請求2: r2=requests.post(‘https://github.com/session‘, data=data, cookies=r1_cookie, # allow_redirects=False ) print(r2.status_code) #200 print(r2.url) #看到的是跳轉後的頁面:https://github.com/ print(r2.history) #看到的是跳轉前的response:[<Response [302]>] print(r2.history[0].text) #看到的是跳轉前的response.text with open("result.html","wb") as f: f.write(r2.content)View Code
2、爬取豆瓣電影信息
import requests import re import json import time from concurrent.futures import ThreadPoolExecutor pool=ThreadPoolExecutor(50) def getPage(url): response=requests.get(url) return response.text def parsePage(res): com=re.compile(‘<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>‘ ‘.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)評價</span>‘,re.S) iter_result=com.finditer(res) return iter_result def gen_movie_info(iter_result): for i in iter_result: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def stored(gen): with open("move_info.txt","a",encoding="utf8") as f: for line in gen: data=json.dumps(line,ensure_ascii=False) f.write(data+"\n") def spider_movie_info(url): res=getPage(url) iter_result=parsePage(res) gen=gen_movie_info(iter_result) stored(gen) def main(num): url=‘https://movie.douban.com/top250?start=%s&filter=‘%num pool.submit(spider_movie_info,url) #spider_movie_info(url) if __name__ == ‘__main__‘: before=time.time() count=0 for i in range(10): main(count) count+=25 after=time.time() print("總共耗費時間:",after-before)View Code
爬蟲之requests模塊