python:urllib和urllib3的區別

阿新 • • 發佈：2020-11-23

　　轉載自文章：https://www.cnblogs.com/jun-1024/p/10546826.html

urllib庫

urllib 是一個用來處理網路請求的python標準庫，它包含4個模組。

urllib.request---請求模組，用於發起網路請求

urllib.parse---解析模組，用於解析URL

urllib.error---異常處理模組，用於處理request引起的異常

urllib.robotparser robots.tx---用於解析robots.txt檔案

urllib.request模組

request模組主要負責構造和發起網路請求，並在其中新增Headers，Proxy等。利用它可以模擬瀏覽器的請求發起過程。

發起網路請求
操作cookie
新增Headers
使用代理

關於urllib.request.urlopen引數的介紹

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen是一個簡單傳送網路請求的方法。它接收一個字串格式的url，它會向傳入的url傳送網路請求，然後返回結果。

先寫一個簡單的例子:

from urllib import request
response = request.urlopen(url='http://www.httpbin.org/get')
print(response.read().decode())

urlopen預設會發送get請求，當傳入data引數時，則會發起POST請求。data引數是位元組型別、者類檔案物件或可迭代物件。

from urllib import request
response = request.urlopen(url='http://www.httpbin.org/post',
                           data=b'username=q123&password=123')
print(response.read().decode())

還才可以設定超時，如果請求超過設定時間，則丟擲異常。timeout沒有指定則用系統預設設定，timeout只對，http，https以及ftp連線起作用。它以秒為單位，比如可以設定timeout=0.1 超時時間為0.1秒。

from urllib import request
response = request.urlopen(url='https://www.baidu.com/',timeout=0.1)

Request物件

利用openurl可以發起最基本的請求，但這幾個簡單的引數不足以構建一個完整的請求，可以利用更強大的Request物件來構建更加完整的請求。

1 . 請求頭新增

通過urllib傳送的請求會有一個預設的Headers: “User-Agent”:“Python-urllib/3.6”，指明請求是由urllib傳送的。所以遇到一些驗證User-Agent的網站時，需要我們自定義Headers把自己偽裝起來。

from urllib import request
headers ={
    'Referer': 'https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&tn=baidu&wd=python%20urllib%E5%BA%93&oq=python%2520urllib%25E5%25BA%2593&rsv_pq=947af0af001c94d0&rsv_t=66135egC273yN5Uj589q%2FvA844PvH9087sbPe9ZJsjA8JA10Z2b3%2BtWMpwo&rqlang=cn&rsv_enter=0&prefixsug=python%2520urllib%25E5%25BA%2593&rsp=0',
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
response = request.Request(url='https://www.baidu.com/',headers=headers)
response = request.urlopen(response)
print(response.read().decode())

2. 操作cookie

在開發爬蟲過程中，對cookie的處理非常重要，urllib的cookie的處理如下案例

from urllib import request
from http import cookiejar
# 建立一個cookie物件
cookie = cookiejar.CookieJar()

# 創一個cookie處理器
cookies = request.HTTPCookieProcessor(cookie)

# 以它為引數，建立opener物件
opener = request.build_opener(cookies)
# 使用這個opener 來發請求
res =opener.open('https://www.baidu.com/')

print(cookies.cookiejar)

3. 設定代理

執行爬蟲的時候，經常會出現被封IP的情況，這時我們就需要使用ip代理來處理，urllib的IP代理的設定如下：

from urllib import request
url ='http://httpbin.org/ip'

#代理地址
proxy ={'http':'172.0.0.1:3128'}

# 代理處理器
proxies =request.ProxyBasicAuthHandler(proxy)

# 建立opener物件
opener = request.build_opener(proxies)

res =opener.open(url)
print(res.read().decode())

urlib庫中的類或或者方法，在傳送網路請求後，都會返回一個urllib.response的物件。它包含了請求回來的資料結果。它包含了一些屬性和方法，供我們處理返回的結果

read() 獲取響應返回的資料，只能用一次

readline() 讀取一行

info() 獲取響應頭資訊

geturl() 獲取訪問的url

getcode() 返回狀態碼

urllib.parse模組

parse.urlencode() 在傳送請求的時候，往往會需要傳遞很多的引數，如果用字串方法去拼接會比較麻煩，parse.urlencode()方法就是用來拼接url引數的。

from urllib import parse
params = {'wd':'測試', 'code':1, 'height':188}
res = parse.urlencode(params)
print(res)

列印結果為wd=%E6%B5%8B%E8%AF%95&code=1&height=188

也可以通過parse.parse_qs()方法將它轉回字典

print(parse.parse_qs('wd=%E6%B5%8B%E8%AF%95&code=1&height=188'))

urllib.error模組

error模組主要負責處理異常，如果請求出現錯誤，我們可以用error模組進行處理主要包含URLError和HTTPError

URLError：是error異常模組的基類，由request模組產生的異常都可以用這個類來處理

HTTPError：是URLError的子類，主要包含三個屬性

Code:請求的狀態碼
reason：錯誤的原因
headers：響應的報頭

from urllib import request,error
try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)

else:
    print("reqeust successfully")

urllib.robotparse模組

robotparse模組主要負責處理爬蟲協議檔案，robots.txt.的解析。 https://www.taobao.com/robots.txt

Robots協議（也稱為爬蟲協議、機器人協議等）的全稱是“網路爬蟲排除標準”（Robots Exclusion Protocol），網站通過Robots協議告訴搜尋引擎哪些頁面可以抓取，哪些頁面不能抓取

urllib庫
urllib3 是一個基於python3的功能強大，友好的http客戶端。越來越多的python應用開始採用urllib3.它提供了很多python標準庫裡沒有的重要功能

安裝：

pip install urllib3

構造請求（request）

import urllib3
# 建立連線
http = urllib3.PoolManager()
# 傳送請求
res = http.request('GET','https://www.baidu.com/')
# 狀態碼
print(res.status)
# 返回的資料
print(res.data.decode())

傳送post請求

import urllib3
# 建立連線
http = urllib3.PoolManager()
# 傳送請求
res = http.request('POST','https://www.baidu.com/',fields={'hello':'word'})
# 狀態碼
print(res.status)
# 返回的資料
print(res.data.decode())

http響應物件提供status, data,和header等屬性

status--狀態碼

data--讀取返回的資料

header--請求頭

返回的json格式資料可以通過json模組，load為字典資料型別。

import json
data={'attribute':'value'}
encode_data= json.dumps(data).encode()

r = http.request('POST',
                     'http://httpbin.org/post',
                     body=encode_data,
                     headers={'Content-Type':'application/json'}
                 )
print(r.data.decode('unicode_escape'))

響應返回的資料都是位元組型別，對於大量的資料我們通過stream來處理更好

import urllib3
http = urllib3.PoolManager()
r =http.request('GET','http://httpbin.org/bytes/1024',preload_content=False)
for chunk in r.stream(32):
    print(chunk)

也可以當做一個檔案物件來處理

import urllib3
http = urllib3.PoolManager()
r =http.request('GET','http://httpbin.org/bytes/1024',preload_content=False)
for chunk in r:
    print(chunk)

urllib3庫Proxies(代理IP)

import urllib3
proxy = urllib3.ProxyManager('http://172.0.0.1:3128')
res =proxy.request('GET','https://www.baidu.com/')
print(res.data)

urllib3庫headers(新增請求頭)

import urllib3
http = urllib3.PoolManager()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
res = http.request('GET','https://www.baidu.com/',headers=headers)
print(res.data)

JSON 當我們需要傳送json資料時，我們需要在request中傳入編碼後的二進位制資料型別的body引數，並制定Content-Type的請求頭

JSON:在發起請求時,可以通過定義body 引數並定義headers的Content-Type引數來發送一個已經過編譯的JSON資料：
import json
data={'attribute':'value'}
encode_data= json.dumps(data).encode()

r = http.request('POST',
                     'http://httpbin.org/post',
                     body=encode_data,
                     headers={'Content-Type':'application/json'}
                 )
print(r.data.decode('unicode_escape'))

對於二進位制的資料上傳，我們用指定body的方式，並設定Content-Type的請求頭

#使用multipart/form-data編碼方式上傳檔案,可以使用和傳入Form data資料一樣的方法進行,並將檔案定義為一個元組的形式　　　　　(file_name,file_data):
with open('1.txt','r+',encoding='UTF-8') as f:
    file_read = f.read()

r = http.request('POST',
                 'http://httpbin.org/post',
                 fields={'filefield':('1.txt', file_read, 'text/plain')
                         })
print(r.data.decode('unicode_escape'))

#二進位制檔案
with open('websocket.jpg','rb') as f2:
    binary_read = f2.read()

r = http.request('POST',
                 'http://httpbin.org/post',
                 body=binary_read,
                 headers={'Content-Type': 'image/jpeg'})
#
# print(json.loads(r.data.decode('utf-8'))['data'] )
print(r.data.decode('utf-8'))

python:urllib和urllib3的區別

python:urllib和urllib3的區別

python dumps和loads區別詳解

Python %r和%s區別程式碼例項解析

基於python cut和qcut的用法及區別詳解

Python中repr和str區別詳解

通過例項瞭解Python str()和repr()的區別

Python檔案讀寫w+和r+區別解析

簡單瞭解python列表和元組的區別

淺談Python中資料夾和python package包的區別

python指令碼和網頁有何區別

python encode和decode的區別

python中的 str 和bytes 區別

Python 爬蟲--urllib 和 re 模組（第一篇）

詳解Python中@staticmethod和@classmethod區別及使用示例程式碼

python模組re中的findall和finditer區別

Python中size和shape區別

記go和python中的slice區別

Python基礎教程：json中load和loads區別

python 中 is和==的區別

Python open和with open用法和區別

python:urllib和urllib3的區別

相關推薦