Python3 Urllib庫的基本使用

阿新 • • 發佈：2018-10-31

一、什麼是Urllib

　　Urllib庫是Python自帶的一個http請求庫，包含以下幾個模組：

urllib.request　　　　請求模組
urllib.error　　　　異常處理模組
urllib.parse　　　　 url解析模組
urllib.robotparser 　　robots.txt解析模組

　　其中前三個模組比較常用，第四個僅作了解。

二、Urllib方法介紹

　　將結合Urllib的官方文件進行說明。首先是urllib.request模組：

urllib.request.

urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

　　示例程式碼1：

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

　　這裡用到了方法的第一個引數，即為URL地址，這種請求方式為GET請求，因為沒有附加任何的引數。read()方法從返回中讀取響應體的內容，讀取完是二進位制位元組流，因此需要呼叫decode()方法通過utf8編碼方式轉換成我們所能讀懂的網頁程式碼。

　　示例程式碼2：

import urllib.parse
import urllib.request
d = bytes(urllib.parse.urlencode({'name':'zhangsan'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=d)
print(response.read().decode('utf-8'))

res:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "zhangsan"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "13",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7"
  },
  "json": null,
  "origin": "183.209.153.56",
  "url": "http://httpbin.org/post"
}

這裡用到了第二個引數data，這次相當於一次post請求，該url是http測試網址。因為urlopen方法的data要求傳入的引數形式是二進位制，所以我們需要對字典進行二進位制轉碼。

　　示例程式碼3：

# 設定請求的超時時間
import socket
import urllib.request

try:
	response = urllib.request.urlopen('http://www.baidu.com',timeout=0.01)
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print('Time Out')

　　這裡使用了timeout引數，設定了一個極短的時間以至於不會在時間內返回。所以程式會丟擲異常。通過判斷異常的型別去列印異常資訊是常用的手段，因此，當異常為timeout時，將列印‘Time Out’。

　　示例程式碼4：

1 # response有用的方法或引數
2 import urllib.request
3 
4 response = urllib.request.urlopen('http://www.python.org')
5 print(response.status)
6 print(response.getHeaders()) # 元祖列表
7 print(response.getHeader('Server'))

　　status為狀態碼，getHeaders()返回響應頭的資訊。但是當我們想傳遞request headers的時候，urlopen就無法支援了，因此這裡需要一個新的方法。

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

　　示例程式碼1：

 1 from urllib import request,parse
 2 
 3 url = 'http://httpbin.org/post'    
 4 headers = {
 5      'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36',   
 6      'Host':'httpbin.org'  
 7 }        
 8 dict = {
 9      'name':'zhangsan'  
10 }    
11 
12 data = bytes(parse.urlencode(dict),encoding='utf8')
13 req = request.Request(url=url,data=data,headers=headers,method=post)
14 response = request.urlopen(req)
15 print(response.read().decode('utf-8'))

　　用Request方法進行post請求並加入了請求頭。

urllib.request.build_opener([handler, ...])

　　Handler是urllib中十分好用的一個工具，當我們進行IP代理訪問或者爬蟲過程保持對話（cookie）時，可以用相應的handler進行操作。以處理cookie的handler為例。

　　程式碼示例2：

1 import http.cookiejar,urllib.request
2 
3 cookie = http.cookiejar.CookieJar()
4 handler = urllib.request.HttpCookieProcessor(cookie)
5 opener = urllib.request.build_opener(handler)
6 response = opener.open('http://www.baidu.com')
7 
8 for item in cookie:
9     print(item.name,item.value)

　　通過CookieJar()來構造一個cookie物件，然後呼叫urllib.request.HttpCookieProcesser()建立一個關於cookie的handler物件，通過這個handler構造opener，然後就可以進行http請求了。返回的response包含cookie資訊，這個handler就可以拿到該cookie資訊並儲存到cookie物件中。cookie的作用在於，如果爬蟲過程中需要維持會話，那可以將cookie加入到Request中。

　　示例程式碼3：

1 import http.cookiejar,urllib.request
2 
3 filename = 'cookie.txt'
4 cookie = http.cookiejar.MozillaCookieJar(filename)
5 handler = urllib.request.HttpCookieProcessor(cookie)
6 opener = urllib.request.build_opener(handler)
7 response = opener.open('http://www.baidu.com')
8 cookie.save(ignore_discard=True,ignore_expires=True)

　　MozillaCookieJar是CookieJar的子類，可以將cookie寫入本地檔案。

　　示例程式碼4：

1 import http.cookiejar,urllib.request
2 
3 cookie = http.cookiejar.MozillaCookieJar()
4 cookie.load('cookie.txt',Ignore_discard=True,Ignore_expires=True)
5 handler = urllib.request.HttpCookieProcessor(cookie)
6 opener = urllib.request.build_opener(handler)
7 response = opener.open('http://www.baidu.com')
8 print(response.read().decode('utf-8'))

　　通過cookie物件的load()方法可以從本地檔案讀取cookie內容，然後可以在request中維持會話狀態。

　　其次是urllib.error模組。

urllib.error

　　示例程式碼1：

 1 from urllib import request,error
 2 
 3 try:
 4     response = request.urlopen('http://bucunzai.com/index.html')
 5 except error.HTTPError as e:
 6     print(e.reason,e.code.e.header,sep='\n')
 7 except error.URLError as e:
 8     print(e.reason)
 9 else:
10     print('Request Successfully')

　　通過官方文件可以看出，httperror是URLerror的子類，所以需要先捕捉子類異常。例項證明HTTPError被捕獲。文件中可以看出，HTTPError有三個引數，分別是reason，code和header。通過例項可以得到code為404。下面將說明一種常見的用法，顯示異常時哪一類異常的方法。

　　示例程式碼2：

1 from urllib import request,error
2 import socket
3 
4 try:
5     response = request.urlopen('http://www.baidu.com',timeout=0.01)
6 except error.URLError as e:
7     if isinstance(e.reason,socket.timeout):
8         print('Time Out')

　　最後看一下urllib.parse中提供的常用方法。

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

　　示例程式碼1：

1 from urllib.parse import urlparse
2 
3 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
4 print(result)
5 # ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

　　最後一行為輸出結果。urlparse方法分析傳入的url結構，並且拆分成相應的元組。scheme引數的作用是提供一個預設值，當url沒有協議資訊時，分析結果的scheme為預設值，如果有則預設值被覆蓋。

　　示例程式碼2：

1 from urllib.parse import urlparse
2 
3 result = urlparse('http://www.baidu.com/index.html;user#comment',allow_fragments=False)
4 print(result)
5 # ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html',params='user#comment', query='', fragment='')

　　可以看到，當fragment引數被設定為false的時候，url中的fragment會被新增到前面有資料的那一項中。如果不清楚URL各部分的含義，可參考本篇備註。

urllib.parse.urlunparse(parts)

　　進行url各部分的拼接，引數形式是一個列表型別。

　　示例程式碼1：

1 from urllib.parse import urlunparse
2 
3 data = ['http','www.baidu.com','index.html','user','a=6','comment']
4 print(urlunparse(data))
5 
6 # http://www.baidu.com/index.html;user?a=6#comment

urllib.parse.urljoin(base, url, allow_fragments=True)

　　示例程式碼1：

1 from urllib.parse import urljoin
2 
3 print(urljoin('http://www.baidu.com','index.html'))
4 print(urljoin('http://www.baidu.com#comment','?username="zhangsan"'))
5 print(urljoin('http://www.baidu.com','www.sohu.com'))
6 
7 # http://www.baidu.com/index.html
8 # http://www.baidu.com?username="zhangsan"
9 # http://www.baidu.com/www.sohu.com

　　這種拼接需要注意其規則，如果第二個引數是第一個引數中沒有的url組成部分，那將進行新增，否則進行覆蓋。第二個print則是一種需要避免的現象，這種join方式會覆蓋掉低級別的引數。這裡的第三個print是一個反例，很多人認為解析是從域名開始的，實際上是從‘//’開始解析的，官方文件給出了很明確的解釋：If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result。所以再次建議，官方文件是最好的學習工具。

urllib.parse.urlencode()

　　urlencode()方法將字典轉換成url的query引數形式的字串。

　　示例程式碼1：

 1 from urllib.parse import urlencode
 2 
 3 params = {
 4   'name':'zhangsan',
 5   'age':22    
 6 }
 7 
 8 base_url = 'http://www.baidu.com?'
 9 url = base_url + urlencode(params)
10 print(url)
11 
12 # 'http://www.baidu.com?name=zhangsan&age=22'

Python3 Urllib庫的基本使用

一、什麼是Urllib

二、Urllib方法介紹

Python3 Urllib庫的基本使用

urllib庫基本使用

python爬蟲 urllib庫基本使用

Python3 Requests庫基本用法

Python3 Urllib庫

Python3 urllib庫和requests庫

Python3 urllib庫學習

python爬蟲（一）urllib庫基本使用

Python3 urllib 庫

Python3 urllib.request庫的基本使用

Python爬蟲入門三之Urllib庫的基本使用

Python爬蟲入門：Urllib庫的基本使用

urllib庫python2和python3具體區別

python3.X版本與2.X版本裏urllib庫的不同

【Python爬蟲學習筆記2】urllib庫的基本使用

urllib庫的學習總結（python3網路爬蟲開發實戰專案）

python3爬蟲之Urllib庫（二）

Python2中urllib、urllib2在Python3中urllib庫匯入對應關係

Python爬蟲之Urllib庫的基本使用

Python3使用urllib庫

Python3 Urllib庫的基本使用

一、什麼是Urllib

二、Urllib方法介紹

相關推薦