Python3中urllib模組的使用

阿新 • • 發佈：2018-12-03

轉載自：https://www.cnblogs.com/php-linux/p/8365941.html

1.基本方法

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

- url: 需要開啟的網址

- data：Post提交的資料

- timeout：設定網站的訪問超時時間

直接用urllib.request模組的urlopen（）獲取頁面，page的資料格式為bytes型別，需要decode（）解碼，轉換成str型別。

1 from urllib import request
2 response = request.urlopen(r'http://python.org/') # <http.client.HTTPResponse object at 0x00000000048BC908> HTTPResponse型別
3 page = response.read()
4 page = page.decode('utf-8')

urlopen返回物件提供方法：

- read() , readline() ,readlines() , fileno() , close() ：對HTTPResponse型別資料進行操作

- info()：返回HTTPMessage物件，表示遠端伺服器返回的頭資訊

- getcode()：返回Http狀態碼。如果是http請求，200請求成功完成;404網址未找到

- geturl()：返回請求的url

2.使用Request

`urllib.request.Request`(url, data=None, headers={}, method=None)

使用request（）來包裝請求，再通過urlopen（）獲取頁面。

 1 url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label'
 2 headers = {
 3     'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
 4                   r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 5     'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',
 6     'Connection': 'keep-alive'
 7 }
 8 req = request.Request(url, headers=headers)
 9 page = request.urlopen(req).read()
10 page = page.decode('utf-8')

用來包裝頭部的資料：

- User-Agent ：這個頭部可以攜帶如下幾條資訊：瀏覽器名和版本號、作業系統名和版本號、預設語言

- Referer：可以用來防止盜鏈，有一些網站圖片顯示來源http://***.com，就是檢查Referer來鑑定的

- Connection：表示連線狀態，記錄Session的狀態。

3.Post資料

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

urlopen（）的data引數預設為None，當data引數不為空的時候，urlopen（）提交方式為Post。

 1 from urllib import request, parse
 2 url = r'http://www.lagou.com/jobs/positionAjax.json?'
 3 headers = {
 4     'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
 5                   r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 6     'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',
 7     'Connection': 'keep-alive'
 8 }
 9 data = {
10     'first': 'true',
11     'pn': 1,
12     'kd': 'Python'
13 }
14 data = parse.urlencode(data).encode('utf-8')
15 req = request.Request(url, headers=headers, data=data)
16 page = request.urlopen(req).read()
17 page = page.decode('utf-8')

`urllib.parse.urlencode`(query, doseq=False, safe='', encoding=None, errors=None)

urlencode（）主要作用就是將url附上要提交的資料。

1 data = {
2     'first': 'true',
3     'pn': 1,
4     'kd': 'Python'
5 }
6 data = parse.urlencode(data).encode('utf-8')

經過urlencode（）轉換後的data資料為?first=true?pn=1?kd=Python，最後提交的url為

http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python

Post的資料必須是bytes或者iterable of bytes，不能是str，因此需要進行encode（）編碼

1 page = request.urlopen(req, data=data).read()

當然，也可以把data的資料封裝在urlopen（）引數中

4.異常處理

 1 def get_page(url):
 2     headers = {
 3         'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
 4                     r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 5         'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',
 6         'Connection': 'keep-alive'
 7     }
 8     data = {
 9         'first': 'true',
10         'pn': 1,
11         'kd': 'Python'
12     }
13     data = parse.urlencode(data).encode('utf-8')
14     req = request.Request(url, headers=headers)
15     try:
16         page = request.urlopen(req, data=data).read()
17         page = page.decode('utf-8')
18     except error.HTTPError as e:
19         print(e.code())
20         print(e.read().decode('utf-8'))
21     return page

5、使用代理

`urllib.request.ProxyHandler`(proxies=None)

當需要抓取的網站設定了訪問限制，這時就需要用到代理來抓取資料。

 1 data = {
 2         'first': 'true',
 3         'pn': 1,
 4         'kd': 'Python'
 5     }
 6 proxy = request.ProxyHandler({'http': '5.22.195.215:80'})  # 設定proxy
 7 opener = request.build_opener(proxy)  # 掛載opener
 8 request.install_opener(opener)  # 安裝opener
 9 data = parse.urlencode(data).encode('utf-8')
10 page = opener.open(url, data).read()
11 page = page.decode('utf-8')
12 return page

Python3中urllib模組的使用

轉載自：https://www.cnblogs.com/php-linux/p/8365941.html 1.基本方法 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, ca

Python3中urllib詳細使用方法(header,代理,超時,認證,異常處理)

com creat exc 最簡 new cond nag bin read python3 抓取網頁資源的 N 種方法 1、最簡單 import urllib.requestresponse = urllib.request.urlopen(‘http://pyth

【Python】python3中urllib爬蟲開發

urlopen 狀態碼 tco processor span agent cond urllib 聲明以下是三種方法 ①First Method 最簡單的方法 ②添加data,http header 使用Request對象 ③CookieJar import urllib

Python2中urllib、urllib2在Python3中urllib庫匯入對應關係

◆在Python2.X中使用import urllib2——對應的，在Python3.X中會便用import urllib.request, urllib.error ◆在Python2.X中使用import urllib——對應的，在Python3.X中會使用import urllib.r

python3中argparse模組詳解

python標準庫sys模組 sys模組用於提供對Python直譯器相關的操作： sys.argv #命令列引數List，第一個元素是程式本身路徑 sys.exit(n) #退出程式，正常退出時exit(0) sys.version

Python3中urllib使用介紹

Py2.x： Urllib庫 Urllin2庫 Py3.x： Urllib庫變化：在Pytho2.x中使用import urllib2——-對應的，在Python3.x中會使用import urllib.request，urllib.error。在Pytho2.x

Python3中queue模組的使用

直接跑程式碼，看結果，結果在最後 from queue import Queue from queue import PriorityQueue print("Queue類實現了一個基本的先進先出(FIFO)容器，使用put()將元素新增到序列尾端，get()從佇列尾部

Python3中urllib庫的使用

urlopen方法 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None) 這是urllib.r

Python2和Python3中urllib庫中urlencode的使用注意事項

前言在Python中，我們通常使用urllib中的urlencode方法將字典編碼，用於提交資料給url等操作，但是在Python2和Python3中urllib模組中所提供的urlencode

python3中argparse模組

1、定義：argparse是python標準庫裡面用來處理命令列引數的庫 2、命令列引數分為位置引數和選項引數：位置引數就是程式根據該引數出現的位置來確定的如：[[email protected]_1 /]#

python3中requests模組操作

requests作為py的引入模組，在介面測試方面非常全面，下面我們來講解一下requests的方法： 1.引入requests模組進入python目錄下使用cmd 輸入 pip install requests 2.requests的get與post請求方法 imp

Python3中tkinter模組使用方法詳解

轉載自覆手為雲p 的部落格，附上原文網址，感覺非常有用，存下檔，謝謝幫助~~~ 1、使用tkinter.Tk() 生成主視窗（root=tkinter.Tk()）； root.title('標題名') 　　　　修改框體的名字,也可在建立時使用classNa

python3中import模組、包、庫的用法

模組的概念：就是.py檔案，裡面定義了一些函式和變數，需要的時候就可以匯入這些模組，python中可以匯入自帶的模組，也可以匯入我們自己編寫的模組（即.py檔案）。包的概念：在模組之上的概念，為了方便管理而將檔案進行打包。一個資料夾下必須要有_init_.py

Python3中urllib.request.retrieve的使用

retrieve這個函式通過help(urllib.request.retrieve)便可得知它的使用方法，簡單的來講它的使用方法就是傳入url和filename便可以使用了，url=‘一個下載連結’

python3 中的 urllib模組和python2的區別與聯絡

3.0版本中已經將urllib2、urlparse、和robotparser併入了urllib中，並且修改urllib模組，其中包含5個子模組，即是help()中看到的那五個名字。為了今後使用方便，在此將每個包中包含的方法列舉如下： urllib.error:

詳解：Python2中的urllib、urllib2與Python3中的urllib以及第三方模組requests

先說說Python2中的url與urllib2（參考此處）：在python2中，urllib和urllib2都是接受URL請求的相關模組，但是提供了不同的功能。兩個最顯著的不同如下： 1、urllib2可以接受一個Request類的例項來設定URL請求的headers，

python爬蟲系列(1.2-urllib模組中request 常用方法)

一、request.Request方法的使用上一章節中介紹了request.urlopen()的使用,僅僅的很簡單的使用,不能設定請求頭及cookie的東西,request.Request()方法就是進一步的包裝請求. 1、原始碼檢視引數 class Request: &nb

[Python3填坑之旅]1、urllib模組網頁爬蟲訪問中文網址出錯

正在學習網頁爬蟲，用的Python3+urllib模組，當遇到連結裡有中文字元的時候總是報錯。之前以為是Python編碼的問題，不斷去嘗試不同的編碼去encode與decode，可以問題總是解決不了，沒有辦法繼續查閱資料，最後發現其實解決方法特別簡單。問題描述當我訪問帶有中文

python3中，os.path模組下常用的用法總結

第一部分 python3中，os.path模組下常用的用法總結 abspath 返回一個目錄的絕對路徑 Return an absolute path. >>> os.path.abspath("/etc/sysconfig/selinux") '/e

Python3中正則模組re.compile、re.match及re.search

本文例項講述了Python3中正則模組re.compile、re.match及re.search函式用法。分享給大家供大家參考，具體如下： re模組 re.compile、re.match、 re.search re 模組官方說明文件正則匹配的時候，第一個字元是 r，表示 raw string 原生字

Python3中urllib模組的使用

1.基本方法

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

2.使用Request

urllib.request.Request(url, data=None, headers={}, method=None)

3.Post資料

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None)

4.異常處理

5、使用代理

urllib.request.ProxyHandler(proxies=None)

相關推薦

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

`urllib.request.Request`(url, data=None, headers={}, method=None)

`urllib.request.urlopen`(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None*)

`urllib.parse.urlencode`(query, doseq=False, safe='', encoding=None, errors=None)

`urllib.request.ProxyHandler`(proxies=None)