Python網路爬蟲（實踐篇）

阿新 • • 發佈：2020-09-09

歡迎關注公眾號：Python爬蟲資料分析挖掘，回覆【開源原始碼】免費獲取更多開源專案原始碼

01 快速爬取網頁

1.1 urlopen()函式

import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
data=file.read()
fhandle=open("./1.html","wb")
fhandle.write(data)
fhandle.close()

讀取內容常見的3種方式，其用法是：
file.read()讀取檔案的全部內容，並把讀取到的內容賦給一個字串變數
file.readlines()讀取檔案的全部內容，並把讀取到的內容賦給一個列表變數

file.readline()讀取檔案的一行內容

1.2 urlretrieve()函式

urlretrieve()函式可以直接將對應資訊寫入本地檔案。

import urllib.request
filename=urllib.request.urlretrieve("http://edu.51cto.com",filename="./1.html")
# urlretrieve()執行過程中，會產生一些快取，可以使用urlcleanup()進行清除
urllib.request.urlcleanup()

1.3 urllib中其他常見用法

import urllib.request
file 
=urllib.request.urlopen("http://www.baidu.com")
# 獲取與當前環境有關的資訊
print(file.info())
 
# Bdpagetype: 1
# Bdqid: 0xb36679e8000736c1
# Cache-Control: private
# Content-Type: text/html;charset=utf-8
# Date: Sun, 24 May 2020 10:53:30 GMT
# Expires: Sun, 24 May 2020 10:52:53 GMT
# P3p: CP=" OTI DSP COR IVA OUR IND COM "
# 
 P3p: CP=" OTI DSP COR IVA OUR IND COM "
# Server: BWS/1.1
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7D3962001F33E17C6:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BIDUPSID=D5BBF02F4454CBA7D3962001F33E17C6; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: PSTM=1590317610; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7FDDF8A87AF5416A6:FG=1; max-age=31536000; expires=Mon, 24-May-21 10:53:30 GMT; domain=.baidu.com; path=/; version=1; comment=bd
# Set-Cookie: BDSVRTM=0; path=/
# Set-Cookie: BD_HOME=1; path=/
# Set-Cookie: H_PS_PSSID=31729_1436_21118_31592_31673_31464_31322_30824; path=/; domain=.baidu.com
# Traceid: 1590317610038396263412927153817753433793
# Vary: Accept-Encoding
# Vary: Accept-Encoding
# X-Ua-Compatible: IE=Edge,chrome=1
# Connection: close
# Transfer-Encoding: chunked

# 獲取當前爬取網頁的狀態碼
print(file.getcode())                     
# 200

# 獲取當前爬取的URL地址
print(file.geturl())                      
# 'http://www.baidu.com'

一般來說，URL標準中只會允許一部分ASCII字元比如數字，字母，部分符號等，而其他一些字元，比如漢子等，是不符合URL標準的。這種情況，需要進行URL編碼方可解決。

import urllib.request
print(urllib.request.quote("http://www.baidu.com"))
# http%3A//www.baidu.com
print(urllib.request.unquote("http%3A//www.baidu.com"))
# http://www.baidu.com

02 瀏覽器的模擬——Header屬性

一些網頁為了防止別人惡意採集其資訊，進行了一些反爬蟲的設定，當我們爬取時，會出現403錯誤。
可以設定一些Headers資訊，模擬稱瀏覽器取訪問這些網站。
可以使用倆種讓爬蟲模擬成瀏覽器訪問的設定方法。

2.1使用build_opener()修改報頭

import urllib.request

url= "http://www.baidu.com"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data=opener.open(url).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

2.2使用add_header()新增報頭

import urllib.request

url= "http://www.baidu.com"
req=urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

03 超時設定

當訪問一個網頁時，如果該網頁長時間未響應，那麼系統就會判斷該網頁超時，即無法開啟該網頁。

import urllib.request

# timeout設定超時時間，單位秒
file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=1)
data = file.read()

04 代理伺服器

使用代理伺服器去爬取某個網站的內容時，在對方網站顯示的不是我們真實的IP地址，而是代理伺服器的IP地址，這樣，即使對方將顯示的IP地址遮蔽了，也無關緊要，因為我們可以換另一個IP地址繼續爬取。

def use_proxy(proxy_addr,url):
    import urllib.request
    proxy= urllib.request.ProxyHandler({'http':proxy_addr})
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(url).read().decode('utf-8')
    return data
proxy_addr="xxx.xx.xxx.xx:xxxx"
data=use_proxy(proxy_addr,"http://www.baidu.com")
print(len(data))

使用urllib.request.install_opener()建立全域性的opener物件，那麼，在使用urlopen()時亦會使用我們安裝的opener物件。

05 Cookie

僅使用HTTP協議的話，我們登入一個網站的時候，假如登陸成功了，但是當我們訪問該網站的其他網頁的時候，該登入狀態就會消失，此時還需要登入一次，所以我們需要將對應的會話資訊，比如登入成功等資訊通過一些方式儲存下來。
常用的方式有倆種：
1）通過Cookie儲存會話資訊
2）通過Session儲存會話資訊
但是，不管通過哪種方式進行會話控制，在大部分時候，都會用到Cookie。
進行Cookie處理的一種常用步驟如下：
1）匯入Cookie處理模組http.cookiejar。
2）使用http.cookiejar.CookieJar()建立CookieJar物件。
3）使用HTTPCookieProcessor建立cookie處理器，並以其為引數構建opener物件。
4）建立全域性預設的opener物件。

import urllib.request
import urllib.parse
import http.cookiejar
url = "http://xx.xx.xx/1.html"
postdata = urllib.parse.urlencode({
    "username":"xxxxxx",
    "password":"xxxxxx"
}).encode("utf-8")
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
# 使用http.cookiejar.CookieJar()建立CookieJar物件
cjar = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor建立cookie處理器，並以其為引數構建opener物件
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
# 建立全域性預設的opener物件
urllib.request.install_opener(opener)
file = opener.open(req)

data=file.read()
fhandle=open("./4.html","wb")
fhandle.write(data)
fhandle.close()

url1 = "http://xx.xx.xx/2.html"
data1= urllib.request.urlopen(url1).read()
fhandle1=open("./5.html","wb")
fhandle1.write(data1)
fhandle1.close()

06 DebugLog

邊執行程式，邊列印除錯的Log日誌。

import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://www.baidu.com")

07異常處理——URLError

import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://blog.baidusss.net")
except urllib.error.HTTPError as e:
    print(e.code)
    print(e.reason)
except urllib.error.URLError as e:
    print(e.reason)

或者

import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

08 HTTP協議請求實戰

HTTP協議請求主要分為6種類型，各型別的主要作用如下：
1）GET請求：GET請求會通過URL網址傳遞資訊，可以直接在URL中寫上要傳遞的資訊，也可以由表單進行傳遞。
如果使用表單進行傳遞，這表單中的資訊會自動轉為URL地址中的資料，通過URL地址傳遞。
2）POST請求：可以向伺服器提交資料，時一種比較主流也比較安全的資料傳遞方式。
3）PUT請求：請求伺服器儲存一個資源，通常要指定儲存的位置。
4）DELETE請求：請求伺服器刪除一個資源。
5）HEAD請求：請求獲取對應的HTTP報頭資訊。
6）OPTIONS請求：可以獲得當前URL所支援的請求型別
除此之外，還有TRACE請求與CONNECT請求，TRACE請求主要用於測試或診斷。

8.1 GET請求例項

使用GET請求，步驟如下：
1）構建對應的URL地址，該URL地址包含GET請求的欄位名和欄位內容等資訊。
GET請求格式：http://網址?欄位1=欄位內容&欄位2=欄位內容
2）以對應的URL為引數，構建Request物件。
3）通過urlopen()開啟構建的Request物件。
4）按照需求進行後續處理操作。

import urllib.request

url="http://www.baidu.com/s?wd="
key="你好"
key_code=urllib.request.quote(key)
url_all=url+key_code
req=urllib.request.Request(url_all)
data=urllib.request.urlopen(req).read()
fh=open("./3.html","wb")
fh.write(data)
fh.close()

8.2 POST請求例項

使用POSt請求，步驟如下：
1）設定好URL網址。
2）構建表單資料，並使用urllib.parse.urlencode對資料進行編碼處理。
3）建立Request物件，引數包括URL地址和要傳遞的資料。
4）使用add_header()新增頭資訊，模擬瀏覽器進行爬取。
5）使用urllib.request.urlopen()開啟對應的Request物件，完成資訊的傳遞。
6）後續處理。

import urllib.request
import urllib.parse

url = "http://www.xxxx.com/post/"
postdata =urllib.parse.urlencode({
"name":"[email protected]",
"pass":"xxxxxxx"
}).encode('utf-8') 
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("D:/Python35/myweb/part4/6.html","wb")
fhandle.write(data)
fhandle.close()

Python網路爬蟲（實踐篇）

01 快速爬取網頁

1.1 urlopen()函式

1.2 urlretrieve()函式

1.3 urllib中其他常見用法

02 瀏覽器的模擬——Header屬性

2.1使用build_opener()修改報頭

2.2使用add_header()新增報頭

03 超時設定

04 代理伺服器

05 Cookie

06 DebugLog

07異常處理——URLError

08 HTTP協議請求實戰

8.1 GET請求例項

8.2 POST請求例項

Python網路爬蟲（實踐篇）

Python網路爬蟲（瀏覽器偽裝技術）

python網路爬蟲（動態網頁）

06.Python網路爬蟲之requests模組（2）

04.Python網路爬蟲之requests模組（1）

16.Python網路爬蟲之Scrapy框架（CrawlSpider）

python網路爬蟲-資料儲存（七）

小白學 Python 爬蟲（8）：網頁基礎

<Python> python從入門到實踐（實踐篇）（1） --詞雲製作

軟體定義網路實驗（七）----Python 中的 REST API 呼叫

python爬蟲（一）---BeautufulSoup

Python爬蟲（二）導包、解釋urllib、bs4

Python爬蟲（一）

python爬蟲（8）-百度小說西遊記

伯陽的網路筆記（二）：HTTP基礎

伯陽的網路筆記（三）：HTTP/2

Python使用指南（一）：在微信頭像上新增紅旗貼畫

從JavaScript到Python之併發（上）

詳解Docker方式實現MySql 主從複製（實踐篇）

python網路爬蟲 CrawlSpider使用詳解

Python網路爬蟲（實踐篇）

01 快速爬取網頁

1.1 urlopen()函式

1.2 urlretrieve()函式

1.3 urllib中其他常見用法

02 瀏覽器的模擬——Header屬性

2.1使用build_opener()修改報頭

2.2使用add_header()新增報頭

03 超時設定

04 代理伺服器

05 Cookie

06 DebugLog

07異常處理——URLError

08 HTTP協議請求實戰

8.1 GET請求例項

8.2 POST請求例項

相關推薦