Python爬蟲第三篇--Urllib庫

阿新 • • 發佈：2018-11-30

簡介

Python內建的HTTP請求庫

urllib.request 請求模組
urllib.error 異常處理模組
urllib.parse url 解析模組
urllib.robotparser robots.txt解析模組

urlopen

功能：給伺服器傳送request請求

urllib.request.urlopen

(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,contex=None)

引數
- data post所帶引數
- ca* ca證書有關

請求網頁原始碼（GET）

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

請求網頁（POST）

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response = urllib. 
request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

http測試網：http://httpbin.org

超時設定

import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

異常處理

import socket
import urllib.request
import urllib.error

try 
:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

響應

響應型別

import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(type(response))

狀態碼、響應頭

import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

Request

增加傳入方式

import urllib.request
request = urllib.request.Request('http://www.python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

from urllib import request, parse
url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Handler

代理

偽裝IP地址

import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

Cookie

客戶端儲存使用者資訊，維持登入狀態

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

異常處理

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

URL解析

urlparse

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)

urlunparse

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

urljoin

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

urlencode

from urllib.parse import urlencode
params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

urllib庫官方說明文件

Python爬蟲第三篇--Urllib庫

簡介 Python內建的HTTP請求庫 urllib.request 請求模組 urllib.error 異常處理模組 urllib.parse url 解析模組 urllib.robotparser robots.txt解析模組 urlopen

Python爬蟲入門三之Urllib庫的基本使用

res 瀏覽器中必須答案文件的網頁 one .com 屏幕截圖 1.分分鐘扒一個網頁下來怎樣扒網頁呢？其實就是根據URL來獲取它的網頁信息，雖然我們在瀏覽器中看到的是一幅幅優美的畫面，但是其實是由瀏覽器解釋才呈現出來的，實質它是一段HTML代碼，加 JS、CSS

Python爬蟲第七篇--PyQuery庫

PyQuery庫簡介網頁解析庫比正則和BeautifulSoup語法更簡單需要熟悉jQuery語法安裝 MacOS：pip3 install pyquery 初始化字串初始化 html = ''' <div

python爬蟲第六篇--BeautifulSoup庫

BeautifulSoup庫簡介靈活方便的網頁解析庫，處理高效，支援多種解析器利用它不用編寫正則表示式即可方便地實現網頁資訊的爬取 MacOS安裝：pip3 install BeautifulSoup4(已經更新到4了) soup = B

Python爬蟲第四篇--Requests庫

Requests簡介 Requests是基於urllib的python庫，比urllib庫更方便採用Apache2 Licensed開源協議的HTTP庫 MacOS安裝：pip3 install requests 例項 import req

Python爬蟲（三）Urllib庫的高階用法

1.設定Headers 有些網站不會同意程式直接用上面的方式進行訪問，如果識別有問題，那麼站點根本不會響應，所以為了完全模擬瀏覽器的工作，我們需要設定一些Headers 的屬性。可以開啟Chrome瀏覽器，除錯瀏覽器F12，開啟網路監聽，嘗試登陸知乎，會發

爬蟲第三篇（語言選擇python還是java還是其他）

爬蟲目前主要開發語言為java、python、c++ 有些公司也用go語言（杭州某網際網路金融公司）對於一般的資訊採集需要，各種語言差別不大。 c、c++ 搜尋引擎無一例外使用C\C++ 開發爬蟲，猜想搜尋引擎爬蟲採集的網站數量巨大，對頁面的解析要求

python之爬蟲（三） Urllib庫的基本使用

捕捉異常 .py bytes bin default onsite text ams out 官方文檔地址：https://docs.python.org/3/library/urllib.html 什麽是Urllib Urllib是python內置的HTTP請求庫包括

Python基礎第三篇：函數

turn 說明代碼名稱維護 span 大小寫 div 邏輯一、Python函數介紹 1.函數的作用規範代碼使代碼變得邏輯性更強提高可讀性，方便管理，降低維護成本，以及降低代碼冗余函數是組織好的，可重復使用的，用來實現單一，或相關聯功能的代碼段。 2.函

圖解Python 【第三篇】：Python-函數

table calc 顯式 art 老男孩 idt 對象高級惰性本節內容一覽圖一、函數介紹 1、什麽是函數 2、定義一個函數你可以定義一個由自己想要功能的函數，以下是簡單的規則：函數代碼塊以 def 關鍵詞開頭，後接函數標識符名

python 基礎第三篇

nic 輸出歐洲模板部分不變準備 alpha numeric 一. 編碼1. 最早的計算機編碼是ASCII. 美國人創建的. 包含了英文字母(大寫字母, 小寫字母). 數字, 標點等特殊字符!@#$%128個碼位 2**7 在此基礎上加了一位 2**88位. 1個

python爬蟲第五篇--正則表示式

Re模組正則表示式概念正則表示式是對字串串操作的⼀一種邏輯公式，就是⽤用事先定義好的⼀一些特定字元、及這些特定字元的組合，組成⼀一個“規則字串串”，這個“規則字串串”⽤用來表達對字串串的⼀一種過濾邏輯非python獨有，re模組實現 re.

“毛星雲OpenCV3程式設計入門之python實現”第三篇讀取視訊+呼叫攝像頭

1.6.1讀取視訊+呼叫攝像頭 # -*- coding: gbk -*- __author__ = 'sunzhilong' import cv2 #讀取視訊，以幀顯示 cap = cv2.VideoCapture("E:/Study/python/Open

Python爬蟲入門四之Urllib庫的高階用法

1.設定Headers 有些網站不會同意程式直接用上面的方式進行訪問，如果識別有問題，那麼站點根本不會響應，所以為了完全模擬瀏覽器的工作，我們需要設定一些Headers 的屬性。首先，開啟我們的瀏覽器，除錯瀏覽器F12，我用的是Chrome，開啟網路監聽，示意如下，

python爬蟲（一）urllib庫基本使用

注，以下內容均為python3.5.*程式碼學習爬蟲，首先有學會使用urllib庫，這個庫可以方便的使我們解析網頁的內容，本篇講一下它的主要用法解析網頁 #!/usr/bin/env python3 # coding=utf-8 import u

Python 爬蟲第三步 -- 多執行緒爬蟲爬取噹噹網書籍資訊

XPath 的安裝以及使用 1 . XPath 的介紹剛學過正則表示式，用的正順手，現在就把正則表示式替換掉，使用 XPath，有人表示這太坑爹了，早知道剛上來就學習 XPath 多省事啊。其實我個人認為學習一下正則表示式是大有益處的，之所以換成 XPa

python爬蟲第三天

匹配 size 子類實例基礎知識 403錯誤 spa 進行程序 DebugLog實戰有時候我們需要在程序運行時，一邊運行一邊打印調試日誌。此時需要開啟DebugLog。如何開啟：首先將debugl

【python爬蟲自學筆記】-----urllib庫的基本使用

urllib是python內建的http請求庫包含以下模組： urllib.request 請求模組，用來開啟和讀取url； urllib.error 異常處理模組 ,包含request產生的錯誤，可以使用try進行捕捉處理； urllib.parse url解析模

python爬蟲從入門到放棄（三）- Urllib庫的基本使用方法1

Urllib 是Python自帶的標準庫，無需安裝，直接可以用。提供瞭如下功能：網頁請求響應獲取代理和cookie設定異常處理URL解析爬蟲所需要的功能，基本上在urllib中都能找到，學習這個標準庫，可以更加深入的理解後面更加便利的requests庫。-----------

python爬蟲入門---第三篇：自動下載圖片

等待部分 app class 請您 pictures string fin from 適用的圖片網站：美桌 import requests import re import urllib from bs4 import BeautifulSoup def get_ht

Python爬蟲第三篇--Urllib庫

簡介

urlopen

響應

Request

Handler

代理

Cookie

異常處理

URL解析

urlparse

urlunparse

urljoin

urlencode

相關推薦