python爬蟲從入門到放棄（四）之 Requests庫的基本使用

阿新 • • 發佈：2019-01-27

pre hist ror ble complete question 進制 cte word

什麽是Requests

Requests是用python語言基於urllib編寫的，采用的是Apache2 Licensed開源協議的HTTP庫
如果你看過上篇文章關於urllib庫的使用，你會發現，其實urllib還是非常不方便的，而Requests它會比urllib更加方便，可以節約我們大量的工作。（用了requests之後，你基本都不願意用urllib了）一句話，requests是python實現的最簡單易用的HTTP庫，建議爬蟲使用requests庫。

默認安裝好python之後，是沒有安裝requests模塊的，需要單獨通過pip安裝

requests功能詳解

總體功能的一個演示

import requests

response  = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
print(response.content)
print(response.content.decode("utf-8"))

我們可以看出response使用起來確實非常方便，這裏有個問題需要註意一下：
很多情況下的網站如果直接response.text會出現亂碼的問題，所以這個使用response.content
這樣返回的數據格式其實是二進制格式，然後通過decode()轉換為utf-8，這樣就解決了通過response.text直接返回顯示亂碼的問題.

請求發出後，Requests 會基於 HTTP 頭部對響應的編碼作出有根據的推測。當你訪問 response.text 之時，Requests 會使用其推測的文本編碼。你可以找出 Requests 使用了什麽編碼，並且能夠使用 response.encoding 屬性來改變它.如：

response =requests.get("http://www.baidu.com")
response.encoding="utf-8"
print(response.text)

不管是通過response.content.decode("utf-8)的方式還是通過response.encoding="utf-8"的方式都可以避免亂碼的問題發生

各種請求方式

requests裏提供個各種請求方式

import requests
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

請求

基本GET請求

import requests

response = requests.get(‘http://httpbin.org/get‘)
print(response.text)

帶參數的GET請求，例子1

import requests

response = requests.get("http://httpbin.org/get?name=zhaofan&age=23")
print(response.text)

如果我們想要在URL查詢字符串傳遞數據，通常我們會通過httpbin.org/get?key=val方式傳遞。Requests模塊允許使用params關鍵字傳遞參數，以一個字典來傳遞這些參數，例子如下：

import requests
data = {
    "name":"zhaofan",
    "age":22
}
response = requests.get("http://httpbin.org/get",params=data)
print(response.url)
print(response.text)

上述兩種的結果是相同的，通過params參數傳遞一個字典內容，從而直接構造url
註意：第二種方式通過字典的方式的時候，如果字典中的參數為None則不會添加到url上

解析json

import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

從結果可以看出requests裏面集成的json其實就是執行了json.loads()方法，兩者的結果是一樣的

獲取二進制數據

在上面提到了response.content，這樣獲取的數據是二進制數據，同樣的這個方法也可以用於下載圖片以及
視頻資源

添加headers
和前面我們將urllib模塊的時候一樣，我們同樣可以定制headers的信息，如當我們直接通過requests請求知乎網站的時候，默認是無法訪問的

import requests
response =requests.get("https://www.zhihu.com")
print(response.text)

這樣會得到如下的錯誤

技術分享圖片

import requests
headers = {

    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response =requests.get("https://www.zhihu.com",headers=headers)

print(response.text)

這樣就可以正常的訪問知乎了

基本POST請求

通過在發送post請求時添加一個data參數，這個data參數可以通過字典構造成，這樣
對於發送post請求就非常方便

import requests

data = {
    "name":"zhaofan",
    "age":23
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.text)

同樣的在發送post請求的時候也可以和發送get請求一樣通過headers參數傳遞一個字典類型的數據

響應

我們可以通過response獲得很多屬性，例子如下

import requests

response = requests.get("http://www.baidu.com")
print(type(response.status_code),response.status_code)
print(type(response.headers),response.headers)
print(type(response.cookies),response.cookies)
print(type(response.url),response.url)
print(type(response.history),response.history)

結果如下：

技術分享圖片

狀態碼判斷
Requests還附帶了一個內置的狀態碼查詢對象
主要有如下內容：

100: (‘continue‘,),
101: (‘switching_protocols‘,),
102: (‘processing‘,),
103: (‘checkpoint‘,),
122: (‘uri_too_long‘, ‘request_uri_too_long‘),
200: (‘ok‘, ‘okay‘, ‘all_ok‘, ‘all_okay‘, ‘all_good‘, ‘\o/‘, ‘?‘),
201: (‘created‘,),
202: (‘accepted‘,),
203: (‘non_authoritative_info‘, ‘non_authoritative_information‘),
204: (‘no_content‘,),
205: (‘reset_content‘, ‘reset‘),
206: (‘partial_content‘, ‘partial‘),
207: (‘multi_status‘, ‘multiple_status‘, ‘multi_stati‘, ‘multiple_stati‘),
208: (‘already_reported‘,),
226: (‘im_used‘,),

Redirection.
300: (‘multiple_choices‘,),
301: (‘moved_permanently‘, ‘moved‘, ‘\o-‘),
302: (‘found‘,),
303: (‘see_other‘, ‘other‘),
304: (‘not_modified‘,),
305: (‘use_proxy‘,),
306: (‘switch_proxy‘,),
307: (‘temporary_redirect‘, ‘temporary_moved‘, ‘temporary‘),
308: (‘permanent_redirect‘,
‘resume_incomplete‘, ‘resume‘,), # These 2 to be removed in 3.0

Client Error.
400: (‘bad_request‘, ‘bad‘),
401: (‘unauthorized‘,),
402: (‘payment_required‘, ‘payment‘),
403: (‘forbidden‘,),
404: (‘not_found‘, ‘-o-‘),
405: (‘method_not_allowed‘, ‘not_allowed‘),
406: (‘not_acceptable‘,),
407: (‘proxy_authentication_required‘, ‘proxy_auth‘, ‘proxy_authentication‘),
408: (‘request_timeout‘, ‘timeout‘),
409: (‘conflict‘,),
410: (‘gone‘,),
411: (‘length_required‘,),
412: (‘precondition_failed‘, ‘precondition‘),
413: (‘request_entity_too_large‘,),
414: (‘request_uri_too_large‘,),
415: (‘unsupported_media_type‘, ‘unsupported_media‘, ‘media_type‘),
416: (‘requested_range_not_satisfiable‘, ‘requested_range‘, ‘range_not_satisfiable‘),
417: (‘expectation_failed‘,),
418: (‘im_a_teapot‘, ‘teapot‘, ‘i_am_a_teapot‘),
421: (‘misdirected_request‘,),
422: (‘unprocessable_entity‘, ‘unprocessable‘),
423: (‘locked‘,),
424: (‘failed_dependency‘, ‘dependency‘),
425: (‘unordered_collection‘, ‘unordered‘),
426: (‘upgrade_required‘, ‘upgrade‘),
428: (‘precondition_required‘, ‘precondition‘),
429: (‘too_many_requests‘, ‘too_many‘),
431: (‘header_fields_too_large‘, ‘fields_too_large‘),
444: (‘no_response‘, ‘none‘),
449: (‘retry_with‘, ‘retry‘),
450: (‘blocked_by_windows_parental_controls‘, ‘parental_controls‘),
451: (‘unavailable_for_legal_reasons‘, ‘legal_reasons‘),
499: (‘client_closed_request‘,),

Server Error.
500: (‘internal_server_error‘, ‘server_error‘, ‘/o\‘, ‘?‘),
501: (‘not_implemented‘,),
502: (‘bad_gateway‘,),
503: (‘service_unavailable‘, ‘unavailable‘),
504: (‘gateway_timeout‘,),
505: (‘http_version_not_supported‘, ‘http_version‘),
506: (‘variant_also_negotiates‘,),
507: (‘insufficient_storage‘,),
509: (‘bandwidth_limit_exceeded‘, ‘bandwidth‘),
510: (‘not_extended‘,),
511: (‘network_authentication_required‘, ‘network_auth‘, ‘network_authentication‘),

通過下面例子測試：（不過通常還是通過狀態碼判斷更方便）

import requests

response= requests.get("http://www.baidu.com")
if response.status_code == requests.codes.ok:
    print("訪問成功")

requests高級用法

文件上傳

實現方法和其他參數類似，也是構造一個字典然後通過files參數傳遞

import requests
files= {"files":open("git.jpeg","rb")}
response = requests.post("http://httpbin.org/post",files=files)
print(response.text)

結果如下：

技術分享圖片

獲取cookie

import requests

response = requests.get("http://www.baidu.com")
print(response.cookies)

for key,value in response.cookies.items():
    print(key+"="+value)

會話維持

cookie的一個作用就是可以用於模擬登陸，做會話維持

import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456")
response = s.get("http://httpbin.org/cookies")
print(response.text)

這是正確的寫法，而下面的寫法則是錯誤的

import requests

requests.get("http://httpbin.org/cookies/set/number/123456")
response = requests.get("http://httpbin.org/cookies")
print(response.text)

因為這種方式是兩次requests請求之間是獨立的，而第一次則是通過創建一個session對象，兩次請求都通過這個對象訪問

證書驗證

現在的很多網站都是https的方式訪問，所以這個時候就涉及到證書的問題

import requests

response = requests.get("https:/www.12306.cn")
print(response.status_code)

默認的12306網站的證書是不合法的，這樣就會提示如下錯誤

技術分享圖片

為了避免這種情況的發生可以通過verify=False
但是這樣是可以訪問到頁面，但是會提示：
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)

解決方法為：

import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)

這樣就不會提示警告信息，當然也可以通過cert參數放入證書路徑

代理設置

import requests

proxies= {
    "http":"http://127.0.0.1:9999",
    "https":"http://127.0.0.1:8888"
}
response  = requests.get("https://www.baidu.com",proxies=proxies)
print(response.text)

如果代理需要設置賬戶名和密碼,只需要將字典更改為如下：
proxies = {
"http":"http://user:[email protected]:9999"
}
如果你的代理是通過sokces這種方式則需要pip install "requests[socks]"
proxies= {
"http":"socks5://127.0.0.1:9999",
"https":"sockes5://127.0.0.1:8888"
}

超時設置

通過timeout參數可以設置超時的時間

認證設置

如果碰到需要認證的網站可以通過requests.auth模塊實現

import requests

from requests.auth import HTTPBasicAuth

response = requests.get("http://120.27.34.24:9001/",auth=HTTPBasicAuth("user","123"))
print(response.status_code)

當然這裏還有一種方式

import requests

response = requests.get("http://120.27.34.24:9001/",auth=("user","123"))
print(response.status_code)

異常處理

關於reqeusts的異常在這裏可以看到詳細內容：
http://www.python-requests.org/en/master/api/#exceptions
所有的異常都是在requests.excepitons中

技術分享圖片

從源碼我們可以看出RequestException繼承IOError,
HTTPError，ConnectionError,Timeout繼承RequestionException
ProxyError，SSLError繼承ConnectionError
ReadTimeout繼承Timeout異常
這裏列舉了一些常用的異常繼承關系，詳細的可以看：
http://cn.python-requests.org/zh_CN/latest/_modules/requests/exceptions.html#RequestException

通過下面的例子進行簡單的演示

import requests

from requests.exceptions import ReadTimeout,ConnectionError,RequestException


try:
    response = requests.get("http://httpbin.org/get",timout=0.1)
    print(response.status_code)
except ReadTimeout:
    print("timeout")
except ConnectionError:
    print("connection Error")
except RequestException:
    print("error")

其實最後測試可以發現，首先被捕捉的異常是timeout,當把網絡斷掉的haul就會捕捉到ConnectionError，如果前面異常都沒有捕捉到，最後也可以通過RequestExctption捕捉到

python爬蟲從入門到放棄（四）之 Requests庫的基本使用

pre hist ror ble complete question 進制 cte word 什麽是Requests Requests是用python語言基於urllib編寫的，采用的是Apache2 Licensed開源協議的HTTP庫如果你看過上篇文章關於urllib庫

python爬蟲從入門到放棄（四）之 Requests庫的基本使用

什麽是Requests

requests功能詳解

總體功能的一個演示

各種請求方式

requests高級用法

python爬蟲從入門到放棄（四）之 Requests庫的基本使用

Python爬蟲從入門到進階(2)之urllib庫的使用

python爬蟲從入門到放棄（六）之 BeautifulSoup庫的使用

python爬蟲從入門到放棄（八）之 Selenium庫的使用

Python入門篇（四）之字符串、字典、集合

Python爬蟲包 BeautifulSoup 學習（四） bs基本物件與函式

【影象處理】OpenCV+Python影象處理入門教程（四）幾何變換

NS2入門學習（四）之Otcl知識點

Python爬蟲從入門到進階(1)之Python概述

Python爬蟲從入門到進階(2)之爬蟲簡介

Python爬蟲從入門到進階(4)之xpath的使用

DQN（Deep Q-learning）入門教程（四）之Q-learning Play Flappy Bird

GAN網路從入門教程（二）之GAN原理

大話設計模式讀書筆記（四）之設計模式基本原則

Python爬蟲從入門到放棄（十四）之 Scrapy框架中選擇器的用法

Python爬蟲從入門到放棄（二十四）之 Scrapy登錄知乎

python爬蟲從入門到放棄（五）之正則的基本使用

Python爬蟲從入門到放棄（十一）之 Scrapy框架整體的一個了解

Python爬蟲從入門到放棄（十三）之 Scrapy框架的命令行詳解

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

python爬蟲從入門到放棄（四）之 Requests庫的基本使用

什麽是Requests

requests功能詳解

總體功能的一個演示

各種請求方式

requests高級用法

相關推薦