Python3 urllib庫和requests庫
1. Python3 使用urllib庫請求網路
1.1 基於urllib庫的GET請求
請求百度首頁www.baidu.com ,不新增請求頭資訊:
1 import urllib.requests 2 3 4 def get_page(): 5 url = 'http://www.baidu.com/' 6 res = urllib.request.urlopen(url=url) 7 page_source = res.read().decode('utf-8') 8 print(page_source) 9 10 11 if __name__ == '__main__': 12 get_page()
輸出顯示百度首頁的原始碼。但是有的網站進行了反爬蟲設定,上述程式碼可能會返回一個40X之類的響應碼,因為該網站識別出了是爬蟲在訪問網站,這時需要偽裝一下爬蟲,讓爬蟲模擬使用者行為,給爬蟲設定headers(User-Agent)屬性,模擬瀏覽器請求網站。
1.2 使用User-Agent偽裝後請求網站
由於urllib.request.urlopen() 函式不接受headers引數,所以需要構建一個urllib.request.Request物件來實現請求頭的設定:
1 import urllib.request 2 3 4 def get_page(): 5 url = 'http://www.baidu.com' 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 8 } 9 request = urllib.request.Request(url=url, headers=headers) 10 res = urllib.request.urlopen(request) 11 page_source = res.read().decode('utf-8') 12 print(page_source) 13 14 15 if __name__ == '__main__': 16 get_page()
新增headers引數,來模擬瀏覽器的行為。
1.3 基於urllib庫的POST請求,並用Cookie保持會話
登陸ChinaUnix論壇,獲取首頁原始碼,然後訪問一個文章。首先不使用Cookie看一下什麼效果:
1 import urllib.request 2 import urllib.parse 3 4 5 def get_page(): 6 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 7 headers = { 8 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 9 } 10 data = { 11 'username': 'StrivePy', 12 'password': 'XXX' 13 } 14 postdata = urllib.parse.urlencode(data).encode('utf-8') 15 req = urllib.request.Request(url=url, data=postdata, headers=headers) 16 res = urllib.request.urlopen(req) 17 page_source = res.read().decode('gbk') 18 print(page_source) 19 20 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 21 res1 = urllib.request.urlopen(url=url1) 22 page_source1 = res1.read().decode('gbk') 23 print(page_source1) 24 25 26 if __name__ == '__main__': 27 get_page()
搜尋原始碼中是否能看見使用者名稱StrivePy,發現登陸成功,但是再請求其它文章時,顯示為遊客狀態,會話狀態沒有保持。現在使用Cookie看一下效果:
1 import urllib.request
2 import urllib.parse
3 import http.cookiejar
4
5
6 def get_page():
7 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z'
8 headers = {
9 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
10 }
11 data = {
12 'username': 'StrivePy',
13 'password': 'XXX'
14 }
15 postdata = urllib.parse.urlencode(data).encode('utf-8')
16 req = urllib.request.Request(url=url, data=postdata, headers=headers)
17 # 建立CookieJar物件
18 cjar = http.cookiejar.CookieJar()
19 # 以CookieJar物件為引數建立Cookie
20 cookie = urllib.request.HTTPCookieProcessor(cjar)
21 # 以Cookie物件為引數建立Opener物件
22 opener = urllib.request.build_opener(cookie)
23 # 將Opener安裝位全域性,覆蓋urlopen函式,也可以臨時使用opener.open()函式
24 urllib.request.install_opener(opener)
25 res = urllib.request.urlopen(req)
26 page_source = res.read().decode('gbk')
27 print(page_source)
28
29 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html'
30 res1 = urllib.request.urlopen(url=url1)
31 page_source1 = res1.read().decode('gbk')
32 print(page_source1)
33
34
35 if __name__ == '__main__':
36 get_page()
結果顯示登陸成功後,再訪問其它文章時,顯示為登陸狀態。若要將Cookie儲存為檔案待下次使用,可以使用MozillaCookieJar物件將Cookie儲存為檔案。
1 import urllib.request
2 import urllib.parse
3 import http.cookiejar
4
5
6 def get_page():
7 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z'
8 headers = {
9 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
10 }
11 data = {
12 'username': 'StrivePy',
13 'password': 'XXX'
14 }
15 postdata = urllib.parse.urlencode(data).encode('utf-8')
16 req = urllib.request.Request(url=url, data=postdata, headers=headers)
17 filename = 'cookies.txt'
18 # 建立CookieJar物件
19 cjar = http.cookiejar.MozillaCookieJar(filename)
20 # 以CookieJar物件為引數建立Cookie
21 cookie = urllib.request.HTTPCookieProcessor(cjar)
22 # 以Cookie物件為引數建立Opener物件
23 opener = urllib.request.build_opener(cookie)
24 # 臨時使用opener來請求
25 opener.open(req)
26 # 將cookie儲存為檔案
27 cjar.save(ignore_discard=True, ignore_expires=True)
會在當前工作目錄生成一個名為cookies.txt的cookie檔案,下次就可以不用登陸(如果cookie沒有失效的話)直接讀取這個檔案來實現免登入訪問。例如不進行登陸直接訪問其中一篇文章(沒登陸也可以訪問,主要是看擡頭是不是登陸狀態):
1 import http.cookiejar
2
3
4 def get_page():
5 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html'
6 filename = 'cookies.txt'
7 cjar = http.cookiejar.MozillaCookieJar(filename)
8 cjar.load(ignore_discard=True, ignore_expires=True)
9 cookie = urllib.request.HTTPCookieProcessor(cjar)
10 opener = urllib.request.build_opener(cookie)
11 res1 = opener.open(url1)
12 page_source1 = res1.read().decode('gbk')
13 print(page_source1)
14
15
16 if __name__ == '__main__':
17 get_page()
結果顯示是以登陸狀態在檢視這篇文章。
1.4 基於urllib庫使用代理請求
使用代理可以有效規避爬蟲被封。
1 import urllib.request
2
3
4 def proxy_test():
5 url = 'http://myip.kkcha.com/'
6 headers = {
7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
8 }
9 request = urllib.request.Request(url=url, headers=headers)
10 proxy = {
11 'http': '180.137.232.101:53281'
12 }
13 # 建立代理Handler物件
14 proxy_handler = urllib.request.ProxyHandler(proxy)
15 # 以Handler物件為引數建立Opener物件
16 opener = urllib.request.build_opener(proxy_handler)
17 # 將Opener安裝為全域性
18 urllib.request.install_opener(opener)
19 response = urllib.request.urlopen(request)
20 page_source = response.read().decode('utf-8')
21 print(page_source)
22
23
24 if __name__ == '__main__':
25 proxy_test()
抓取到的頁面應該顯示代理IP,不知道什麼原因,有時候能正常顯示,有時候跳轉到有道詞典廣告頁!!!問題有待更進一步研究
2. Python3 使用requsets庫訪問網路
2.1 基於requests庫的GET請求
以GET方式請求http://httpbin.org測試網站。
1 import requests
2
3
4 def request_test():
5 url = 'http://httpbin.org/get'
6 response = requests.get(url)
7 print(type(response.text), response.text)
8 print(type(response.content), response.content)
9
10
11 if __name__ == '__main__':
12 request_test()
直接得到響應體。
1 <class 'str'> {"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"121.61.132.191","url":"http://httpbin.org/get"}
2
3 <class 'bytes'> b'{"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"121.61.132.191","url":"http://httpbin.org/get"}\n
在GET方法中傳遞引數的三種方式:
- 將字典形式的引數用urllib.parse.urlencode()函式編碼成url引數:
1 http://httpbin.org/key1=value1&key2=value2
- 直接在urllib.request.get()函式中使用params引數:
1 import requests 2 3 if __name__ == '__main__': 4 payload = { 5 'key1': 'value1', 6 'key2': 'value2' 7 } 8 response = requests.get('http://httpbin.org/get', params=payload) 9 print(response.url)
1 http://httpbin.org/key1=value1&key2=value2
- url直接包含引數:
1 http://httpbin.org/get?key2=value2&key1=value1
2.2 基於requests庫的POST請求,並用session保持會話
登陸ChinaUnix論壇,獲取首頁原始碼,然後訪問一個文章。首先不使用Session看一下什麼效果:
1 import requests
3
4
5 def get_page():
6 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z'
7 headers = {
8 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
9 }
10 data = {
11 'username': 'StrivePy',
12 'password': 'XXX'
13 }
14 response = requests.post(url=url, data=data, headers=headers)
15 page_source = response.text
16 print(response.status_code)
17 print(page_source)
18
19 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html'
20 response1 = requests.get(url=url1, headers=headers)
21 page_source1 = response1.text
22 print(response1.status_code)
23 print(page_source1)
24
25
26 if __name__ == '__main__':
27 get_page()
結果顯示訪問其它文章時為遊客模式。接下來用session來維持會話看一下效果:
1 import requests
2
3
4 def get_page():
5 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z'
6 headers = {
7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
8 }
9 data = {
10 'username': 'StrivePy',
11 'password': 'XXX'
12 }
13 session = requests.session()
14 response = session.post(url=url, data=data, headers=headers)
15 page_source = response.text
16 print(response.status_code)
17 print(page_source)
18
19 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html'
20 response1 = session.get(url=url1, headers=headers)
21 page_source1 = response1.text
22 print(response1.status_code)
23 print(page_source1)
24
25
26 if __name__ == '__main__':
27 get_page()
結果顯示訪問其它文章時,顯示為登陸狀態,會話保持住了。使用session的效果類似於urllib庫臨時使用opener或者將opener安裝為全域性的效果。
2.3 基於requests庫使用代理請求
在requests庫中使用代理:
1 import requests
2
3
4 def proxy_test():
5 url = 'http://myip.kkcha.com/'
6 headers = {
7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
8 }
9 proxy = {
10 'https': '61.135.217.7: 80'
11 }
12 response = requests.get(url=url, headers=headers, proxies=proxy)
13 print(response.text)
14
15
16 if __name__ == '__main__':
17 proxy_test()