python 爬蟲請求模組requests詳解
requests
相比urllib,第三方庫requests更加簡單人性化,是爬蟲工作中常用的庫
requests安裝
初級爬蟲的開始主要是使用requests模組
安裝requests模組:
Windows系統:
cmd中:
pip install requests
mac系統中:
終端中:
pip3 install requests
requests庫的基本使用
import requests url = 'https://www.csdn.net/' reponse = requests.get(url) #返回unicode格式的資料(str) print(reponse.text)
響應物件response的⽅法
response.text 返回unicode格式的資料(str)
response.content 返回位元組流資料(⼆進位制)
response.content.decode(‘utf-8') ⼿動進⾏解碼
response.url 返回url
response.encode() = ‘編碼'
狀態碼
response.status_code: 檢查響應的狀態碼
例如:
200 : 請求成功
301 : 永久重定向
302 : 臨時重定向
403 : 伺服器拒絕請求
404 : 請求失敗(伺服器⽆法根據客戶端的請求找到資源(⽹⻚))
# 匯入requests import requests # 呼叫requests中的get()方法來向伺服器傳送請求,括號內的url引數就是我們 # 需要訪問的網址,然後將獲取到的響應通過變數response儲存起來 url = 'https://www.csdn.net/' # csdn官網連結連結 response = requests.get(url) print(response.status_code) # response.status_code: 檢查響應的狀態碼
200
請求⽅式
requests的幾種請求方式:
p = requests.get(url) p = requests.post(url) p = requests.put(url,data={'key':'value'}) p = requests.delete(url) p = requests.head(url) p = requests.options(url)
GET請求
HTTP預設的請求方法就是GET
* 沒有請求體
* 資料必須在1K之內!
* GET請求資料會暴露在瀏覽器的位址列中
GET請求常用的操作:
1. 在瀏覽器的位址列中直接給出URL,那麼就一定是GET請求
2. 點選頁面上的超連結也一定是GET請求
3. 提交表單時,表單預設使用GET請求,但可以設定為POST
POST請求
(1). 資料不會出現在位址列中
(2). 資料的大小沒有上限
(3). 有請求體
(4). 請求體中如果存在中文,會使用URL編碼!
requests.post()用法與requests.get()完全一致,特殊的是requests.post()有一個data引數,用來存放請求體資料
請求頭
當我們開啟一個網頁時,瀏覽器要向網站伺服器傳送一個HTTP請求頭,然後網站伺服器根據HTTP請求頭的內容生成當此請求的內容傳送給伺服器。
我們可以手動設定請求頭的內容:
import requests header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.88 Safari/537.36'} url = 'https://www.csdn.net/' reponse = requests.get(url,headers=header) #列印文字形式 print(reponse.text)
requests設定代理
使⽤requests新增代理只需要在請求⽅法中(get/post)傳遞proxies引數就可以了
cookie
cookie :通過在客戶端記錄的資訊確定⽤戶身份
HTTP是⼀種⽆連線協議,客戶端和伺服器互動僅僅限於 請求/響應過程,結束後 斷開,下⼀次請求時,伺服器會認為是⼀個新的客戶端,為了維護他們之間的連線,讓伺服器知道這是前⼀個⽤戶發起的請求,必須在⼀個地⽅儲存客戶端資訊。
requests操作Cookies很簡單,只需要指定cookies引數即可
import requests #這段cookies是從CSDN官網控制檯中複製的 header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/79.0.3945.88 Safari/537.36','cookie': 'uuid_tt_dd=10_30835064740-1583844255125-466273; dc_session_id=10_1583844255125.696601; __gads=ID=23811027bd34da29:T=1583844256:S=ALNI_MY6f7VlmNJKxrkHd2WKUIBQ34Bbnw; UserName=xdc1812547560; UserInfo=708aa833b2064ba9bb8ab0be63866b58; UserToken=708aa833b2064ba9bb8ab0be63866b58; UserNick=xdc1812547560; AU=F85; UN=xdc1812547560; BT=1590317415705; p_uid=U000000; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_30835064740-1583844255125-466273!5744*1*xdc1812547560; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22xdc1812547560%22%2C%22scope%22%3A1%7D%7D; log_Id_click=1; Hm_lvt_feacd7cde2017fd3b499802fc6a6dbb4=1595575203; Hm_up_feacd7cde2017fd3b499802fc6a6dbb4=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22xdc1812547560%22%2C%22scope%22%3A1%7D%7D; Hm_ct_feacd7cde2017fd3b499802fc6a6dbb4=5744*1*xdc1812547560!6525*1*10_30835064740-1583844255125-466273; Hm_up_facf15707d34a73694bf5c0d571a4a72=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22xdc1812547560%22%2C%22scope%22%3A1%7D%7D; Hm_ct_facf15707d34a73694bf5c0d571a4a72=5744*1*xdc1812547560!6525*1*10_30835064740-1583844255125-466273; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Flive.csdn.net%252Froom%252Fyzkskaka%252Fats4dBdZ%253Futm_source%253D908346557%2522%252C%2522announcementCount%2522%253A0%257D; Hm_lvt_facf15707d34a73694bf5c0d571a4a72=1596946584,1597134917,1597155835,1597206739; searchHistoryArray=%255B%2522%25E8%258F%259C%25E9%25B8%259FIT%25E5%25A5%25B3%2522%252C%2522%25E5%25AE%25A2%25E6%259C%258D%2522%255D; log_Id_pv=7; log_Id_view=8; dc_sid=c0efd34d6da090a1fccd033091e0dc53; TY_SESSION_ID=7d77f76f-a4b1-43ef-9bb5-0aebee8ee475; c_ref=https%3A//www.baidu.com/link; c_first_ref=www.baidu.com; c_first_page=https%3A//www.csdn.net/; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1597245305,1597254589,1597290418,1597378513; c_segment=1; dc_tos=qf1jz2; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1597387359'} url = 'https://www.csdn.net/' reponse = requests.get(url,headers=header) #列印文字形式 print(reponse.text)
session
session :通過在服務端記錄的資訊確定⽤戶身份
這⾥這個session就是⼀個指 的是會話
會話物件是一種高階的用法,可以跨請求保持某些引數,比如在同一個Session例項之間儲存Cookie,像瀏覽器一樣,我們並不需要每次請求Cookie,Session會自動在後續的請求中新增獲取的Cookie,這種處理方式在同一站點連續請求中特別方便
處理不信任的SSL證書
什麼是SSL證書?
SSL證書是數字證書的⼀種,類似於駕駛證、護照和營業執照的電⼦副本。
因為配置在伺服器上,也稱為SSL伺服器證書。SSL 證書就是遵守 SSL協 議,由受信任的數字證書頒發機構CA,在驗證伺服器身份後頒發,具有服務 器身份驗證和資料傳輸加密功能
我們來爬一個證書不太合格的網站
import requests url = 'https://inv-veri.chinatax.gov.cn/' resp = requests.get(url) print(resp.text)
它報了一個錯
我們來修改一下程式碼
import requests url = 'https://inv-veri.chinatax.gov.cn/' resp = requests.get(url,verify = False) print(resp.text)
我們的程式碼又能成功爬取了
到此這篇關於python 爬蟲請求模組requests的文章就介紹到這了,更多相關python 爬蟲requests模組內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們!