python爬蟲之requests的基本使用

阿新 • • 發佈：2019-01-04

簡介

Requests是用python語言基於urllib編寫的，採用的是Apache2 Licensed開源協議的HTTP庫，Requests它會比urllib更加方便，可以節約我們大量的工作。

一、安裝

pip快速安裝pip install requests

二、使用

　　1、先上一串程式碼

import requests

response  = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
#獲得響應頭內容 

print(response.headers)
print(response.headers['content-type'])
#還可以用這種方式獲取請求頭內容
print(response.request.headers)

response.enconding = "utf-8'
print(response.text)

print(response.cookies)

print(response.content)
print(response.content.decode("utf-8"))

response.text返回的是Unicode格式，通常需要轉換為utf-8格式，否則就是亂碼。response.content是二進位制模式，可以下載視訊之類的，如果想看的話需要decode成utf-8格式。

不管是通過response.content.decode(“utf-8)的方式還是通過response.encoding=”utf-8”的方式都可以避免亂碼的問題發生

2、一大推請求方式

import requests
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get" 
)

基本GET:

import requests

url = 'https://www.baidu.com/'
response = requests.get(url)
print(response.text)

帶引數的GET請求：

下面提交的資料是往這個地址傳送data裡面的資料。

import requests

url = 'http://httpbin.org/get'
data = {
    'name':'zhangsan',
    'age':'25'
}
response = requests.get(url,params=data)
print(response.url)
print(response.text)

Json資料：

從下面的資料中我們可以得出，如果結果：

1、requests中response.json()方法等同於json.loads（response.text）方法

import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json())
print(response.json()['data'])

獲取二進位制資料

在上面提到了response.content，這樣獲取的資料是二進位制資料，同樣的這個方法也可以用於下載圖片以及視訊資源

新增header

首先說，為什麼要加header（頭部資訊）呢？例如下面，我們試圖訪問知乎的登入頁面（當然大家都你要是不登入知乎，就看不到裡面的內容），我們試試不加header資訊會報什麼錯。

import requests

url = 'https://www.zhihu.com/'
response = requests.get(url)
response.encoding = "utf-8"
print(response.text)

結果：

提示發生內部伺服器錯誤（也就說你連知乎登入頁面的html都下載不下來）。

<html><body><h1>500 Server Error</h1>
An internal server error occured.
</body></html>

如果想訪問就必須得加headers資訊。

import requests

url = 'https://www.zhihu.com/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}
response = requests.get(url,headers=headers)
print(response.text)

基本post請求：

通過post把資料提交到url地址，等同於一字典的形式提交form表單裡面的資料

import requests

url = 'http://httpbin.org/post'
data = {
    'name':'jack',
    'age':'23'
    }
response = requests.post(url,data=data)
print(response.text)

結果：

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "age": "23",
    "name": "jack"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Content-Length": "16",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.13.0"
  },
  "json": null,
  "origin": "118.144.137.95",
  "url": "http://httpbin.org/post"
}

響應：

import requests

#allow_redirects=False#設定這個屬性為False則是不允許重定向，反之可以重定向 
response = requests.get("http://www.baidu.com",allow_redirects=False)
#列印請求頁面的狀態（狀態碼）
print(type(response.status_code),response.status_code)
#列印請求網址的headers所有資訊
print(type(response.headers),response.headers)
#列印請求網址的cookies資訊
print(type(response.cookies),response.cookies)
#列印請求網址的地址
print(type(response.url),response.url)
#列印請求的歷史記錄（以列表的形式顯示）
print(type(response.history),response.history)

內建的狀態碼：

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

import requests
response = requests.get('http://www.jianshu.com/404.html')
# 使用request內建的字母判斷狀態碼

#如果response返回的狀態碼是非正常的就返回404錯誤
if response.status_code != requests.codes.ok:
    print('404')

#如果頁面返回的狀態碼是200，就列印下面的狀態
response = requests.get('http://www.jianshu.com')
if response.status_code == 200:
    print('200')

request的高階操作

檔案上傳

import requests
url = "http://httpbin.org/post"
files= {"files":open("test.jpg","rb")}
response = requests.post(url,files=files)
print(response.text)

結果：

獲取cookie

import requests
response = requests.get('https://www.baidu.com')
print(response.cookies)
for key,value in response.cookies.items():
    print(key,'==',value)

會話維持

cookie的一個作用就是可以用於模擬登陸，做會話維持

import requests
session = requests.session()
session.get('http://httpbin.org/cookies/set/number/12456')
response = session.get('http://httpbin.org/cookies')
print(response.text)

證書驗證

1、無證書訪問

import requests
response = requests.get('https://www.12306.cn')
# 在請求https時，request會進行證書的驗證，如果驗證失敗則會丟擲異常
print(response.status_code)

報錯：

關閉證書驗證

import requests
# 關閉驗證，但是仍然會報出證書警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

為了避免這種情況的發生可以通過verify=False，但是這樣是可以訪問到頁面結果

消除驗證證書的警報

from requests.packages import urllib3
import requests

urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

　　手動設定證書

import requests

response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

代理設定

1、設定普通代理

import requests

proxies = {
  "http": "http://127.0.0.1:9743",
  "https": "https://127.0.0.1:9743",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

2、設定使用者名稱和密碼代理

　　設定socks代理

安裝socks模組 pip3 install ``'requests[socks]'

import requests

proxies = {
    'http': 'socks5://127.0.0.1:9742',
    'https': 'socks5://127.0.0.1:9742'
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

超時設定

通過timeout引數可以設定超時的時間

import requests
from requests.exceptions import ReadTimeout

try:
    # 設定必須在500ms內收到響應，不然或丟擲ReadTimeout異常
    response = requests.get("http://httpbin.org/get", timeout=0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')

認證設定

如果碰到需要認證的網站可以通過requests.auth模組實現

import requests
from requests.auth import HTTPBasicAuth
<br>#方法一
r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))<br>
#方法二<br>r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)

異常處理

所有的異常都是在requests.excepitons中

從原始碼我們可以看出
RequestException繼承IOError,
HTTPError，ConnectionError,Timeout繼承RequestionException，ProxyError，SSLError繼承ConnectionError，
ReadTimeout繼承Timeout異常

通過下面的例子進行簡單的演示

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')
except ConnectionError:
    print('Connection error')
except RequestException:
    print('Error')

首先被捕捉的異常是timeout,當把網路斷掉的haul就會捕捉到ConnectionError，如果前面異常都沒有捕捉到，最後也可以通過RequestExctption捕捉到

附加：

Python爬蟲之Requests庫的基本使用

1 import requests 2 response = requests.get('http://www.baidu.com/') 3 print(type(response)) 4 print(response.status_code) 5 print(type(respon

python爬蟲之requests的基本使用

簡介 Requests是用python語言基於urllib編寫的，採用的是Apache2 Licensed開源協議的HTTP庫，Requests它會比urllib更加方便，可以節約我們大量的工作。一、安裝 pip快速安裝pip install r

python爬蟲之requests模塊

.post 過大 form表單提交 www xxxxxx psd method date .com 一. 登錄事例 a. 查找汽車之家新聞標題鏈接圖片寫入本地 import requests from bs4 import BeautifulSoup import

Python爬蟲之requests+正則表示式抓取貓眼電影top100以及瓜子二手網二手車資訊(四)

{'index': '1', 'image': 'http://p1.meituan.net/movie/[email protected]_220h_1e_1c', 'title': '霸王別姬', 'actor': '張國榮,張豐毅,鞏俐', 'time': '1993-01-01', 'sc

python爬蟲之requests庫詳解（一，如何通過requests來獲得頁面資訊）

前言：爬蟲的基礎是與網頁建立聯絡，而我們可以通過get和post兩種方式來建立連線，而我們可以通過引入urllib庫[在python3的環境下匯入的是urllib；而python2的環境下是urllib和urllib2]或者requests庫來實現,從程式的複雜度和可讀性

Python爬蟲之requests庫(三)：傳送表單資料和JSON資料

import requests 一、傳送表單資料要傳送表單資料，只需要將一個字典傳遞給引數data payload = {'key1': 'value1', 'key2': 'value

python爬蟲之requests對https的限制訪問

如果想要爬取的一個網站時返回錯誤如下： /usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl_.py:79

Python爬蟲之requests庫(五)：Cookie、超時、重定向和請求歷史

import requests 一、Cookie 獲取伺服器響應中的cookie資訊 url = 'http://example.com/some/cookie/setting/url'

python學習（6）：python爬蟲之requests和BeautifulSoup的使用

前言： Requests庫跟urllib庫的作用相似，都是根據http協議操作各種訊息和頁面。都說Requests庫比urllib庫好用，我也沒有體會到好在哪兒。但是，urllib庫有一點不爽的

Python爬蟲之requests模塊(2)

env odi 發送名稱相關防止 tip htm useragent 一.今日內容 session處理cookie proxies參數設置請求代理ip 基於線程池的數據爬取二.回顧 xpath的解析流程 bs4的解析流程常用xpath表達

Python爬蟲之requests模塊(1)

字典 win64 login 綜合 NPU apply 如果 .... email 一.引入 Requests 唯一的一個非轉基因的 Python HTTP 庫，人類可以安全享用。警告：非專業使用其他 HTTP 庫會導致危險的副作用，包括：安全缺陷癥、冗余代碼癥、重新

python 爬蟲之requests爬取頁面圖片的url，並將圖片下載到本地

大家好我叫hardy 需求：爬取某個頁面，並把該頁面的圖片下載到本地思考：　　img標籤一個有多少種類型的src值？三種：1、以http開頭的網路連結。2、以“/”開頭絕對路徑。3、以“./”開頭相對路徑。當然還有其他型

Python爬蟲之requests模組

獲取響應資訊 import requests response = requests.get('http://www.baidu.com') print(response.status_code) # 狀態碼 print(response.url) # 請求url print(respon

Python爬蟲之使用Fiddler+Postman+Python的requests模塊爬取各國國旗

urlencode Coding 5.0 思路想要得到 RM lib 微信公眾號介紹 ??本篇博客將會介紹一個Python爬蟲，用來爬取各個國家的國旗，主要的目標是為了展示如何在Python的requests模塊中使用POST方法來爬取網頁內容。 ??為了知道POST

python 爬蟲之BeautifulSoup 庫的基本使用

rip data lin value 訪問 pytho 輕松 register tex import urllib2url = ‘http://www.someserver.com/cgi-bin/register.cgi‘values = {}values[‘name‘]

爬蟲之 Requests庫的基本使用

什麼是Requests Requests是用python語言基於urllib編寫的，採用的是Apache2 Licensed開源協議的HTTP庫如果你看過上篇文章關於urllib庫的使用，你會發現，其實urllib還是非常不方便的，而Requests它會比urllib更加方便，可以節約我們大量的工作

Python網路爬蟲之requests庫Scrapy爬蟲比較

requests庫Scrapy爬蟲比較相同點：都可以進行頁面請求和爬取，Python爬蟲的兩個重要技術路線兩者可用性都好，文件豐富，入門簡單。兩者都沒有處理JS，提交表單，應對驗證碼等功能（可擴充套件）想爬取有驗證碼的，換需要學習別的庫知識。不同點： Scrapy,非同

Python爬蟲之Urllib庫的基本使用

狀態碼 chrom 異常處理 false 基本 sta col thead kit # get請求 import urllib.request response = urllib.request.urlopen("http://www.baidu.com") print(

python爬蟲之基本類庫

簡單梳理一下爬蟲原理：　　1、傳送請求　　　　通過HTTP庫向目標站點發起請求，即傳送一個Request，請求可以包含額外的headers等資訊，等待伺服器響應。　　2、獲取響應內容　　　　如果伺服器能正常響應（正常響應返回狀態碼通常為200），會得到一個Response，Response的內

python爬蟲之xpath的基本使用 python爬蟲之xpath的基本使用

python爬蟲之xpath的基本使用一、簡介　　XPath 是一門在 XML 文件中查詢資訊的語言。XPath 可用來在 XML 文件中對元素和屬性進行遍歷。XPath 是 W3C XSLT 標準的主要元素，並且 XQuery 和 XPointer 都構建於

python爬蟲之requests的基本使用

簡介

一、安裝

二、使用

request的高階操作

手動設定證書

設定socks代理

超時設定

附加：

相關推薦

　　手動設定證書

　　設定socks代理