Python------URL解析
阿新 • • 發佈:2019-02-05
1.爬蟲
(1)瀏覽網頁時經歷的過程 瀏覽器 (請求request)-> 輸入URL地址(http://www.baidu.com/index.html file:///mnt ftp://172.25.254.250/pub -> http協議確定, www.baidu.com訪問的域名確定 -> DNS伺服器解析到IP地址 -> 確定要訪問的網頁內容 -> 將獲取到的頁面內容返回給瀏覽器(響應過程) ) (2) 爬取網頁 1). 基本方法 from urllib import request from urllib.error import URLError try: respose = request.urlopen('http://www.baidu.com', timeout=0.01) content = respose.read().decode('utf-8') print(content) except URLError as e: print("訪問超時", e.reason) 2). 使用Resuest物件(可以新增其他的頭部資訊) from urllib import request from urllib.error import URLError url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'} try: 例項化request物件, 可以自定義請求的頭部資訊;req = request.Request(url, headers=headers) urlopen不僅可以傳遞url地址, 也可以傳遞request物件; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 執行結果: 後續新增頭部資訊 from urllib import request from urllib.error import URLError url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' try: 例項化request物件, 可以自定義請求的頭部資訊;req = request.Request(url) req.add_header('User-Agent',user_agent) urlopen不僅可以傳遞url地址, 也可以傳遞request物件; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 執行結果:部分截圖 (3)反爬蟲策略 1). 模擬瀏覽器(同上) 1.Android Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19 Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2.Firefox Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0 3.Google Chrome Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36 Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19 4.iOS Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3IP代理 當抓取網站時, 程式的執行速度很快, 如果通過爬蟲去訪問, 一個固定的ip訪問頻率很高, 網站如果做反爬蟲策略, 那麼就會封掉ip; 如何解決? - 設定延遲;time.sleep(random.randint(1,5)) - 使用IP代理, 讓其他IP代替你的IP訪問; 如何獲取代理IP? http://www.xicidaili.com/ 如何實現步驟? 1). 呼叫urllib.request.ProxyHandler(proxies=None); --- 類似理解為Request物件 2). 呼叫Opener--- 類似與urlopen, 這個是定製的 3). 安裝Opener 4). 代理IP的選擇
from urllib import request from urllib.error import URLError url = 'https://httpbin.org/get' proxy = {'https':'171.221.239.11:808', 'http':'218.14.115.211:3128'} user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' 呼叫urllib.request.ProxyHandler(proxies=None); --- 類似理解為Request物件 proxy_support = request.ProxyHandler(proxy) 呼叫Opener - -- 類似與urlopen, 這個是定製的 opener = request.build_opener(proxy_support) 偽裝瀏覽器 opener.addheaders = [('User-Agent',user_agent)] 安裝Opener request.install_opener(opener) 代理IP的選擇 response = request.urlopen(url) content = response.read().decode('utf-8') print(content)
執行結果:
2.儲存cookie資訊
1>cookie資訊是什麼? cookie, 某些網站為了辨別使用者身份, 只有登陸之後才能訪問某個頁面; 進行一個會話跟蹤, 將使用者的相關資訊包括使用者名稱等儲存到本地終端 "uuid_tt_dd=10_19652618930-1532935869256-858290; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1538533772,1538552856,1539399036; ADHOC_MEMBERSHIP_CLIENT_ID1.0=99bc6db6-14c6-e386-1bd0-89ff0447e258; smidV2=20180908162356b9a9b99821267a2d7b2fccd5f4d8129d00a43c058f4e41e20; UN=gf_lvah; BT=1539399071335; dc_session_id=10_1538533770767.652739; TY_SESSION_ID=85f3368a-4382-44cc-b49e-5c4e0d13080c; dc_tos=pginxg; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1539399076; UserName=gf_lvah; UserInfo=QAG2Pho%2B1xl0ZfBdwCCAFT9u9yaIgalLraXhrQ7UQ%2FKy9YH9nIraCT%2FwEBU%2BMDLiXwKXmBNQWmFDaRvwXkGFH%2FhbJ3q6ceS69ezJDfFxagisBJar1pLzXWsVJ4A1AqXX; UserNick=MyWestos; AU=516; UserToken=QAG2Pho%2B1xl0ZfBdwCCAFT9u9yaIgalLraXhrQ7UQ%2FKy9YH9nIraCT%2FwEBU%2BMDLiXwKXmBNQWmFDaRvwXkGFH%2FhbJ3q6ceS69ezJDfFxagj0%2BCh019zDdo5BtfMg44vykcNOlmP2fNOJZiwZg1%2B5egpvFxxqU4W42NpY7LWVgV4%3D" 2>實現步驟: from http import cookiejar from urllib.request import HTTPCookieProcessor from urllib import request 如何將Cookie儲存到變數中, 或者檔案中; 宣告一個CookieJar ---> FileCookieJar --> MozillaCookie cookie = cookiejar.CookieJar() 利用urllib.request的HTTPCookieProcessor建立一個cookie處理器 handler = HTTPCookieProcessor(cookie) 通過CookieHandler建立opener 預設使用的openr就是urlopen; opener = request.build_opener(handler) 開啟url頁面 response = opener.open('http://www.baidu.com') 列印該頁面的cookie資訊 print(cookie) for item in cookie: print(item) 執行結果: 3>如何將Cookie以指定格式儲存到檔案中 設定儲存cookie的檔名 cookieFilename = 'cookie.txt' 宣告一個MozillaCookie,用來儲存cookie並且可以寫入文進阿 cookie = cookiejar.MozillaCookieJar(filename=cookieFilename) 利用urllib.request的HTTPCookieProcessor建立一個cookie處理器 handler = HTTPCookieProcessor(cookie) 通過CookieHandler建立opener 預設使用的openr就是urlopen; opener = request.build_opener(handler) 開啟url頁面 response = opener.open('http://www.baidu.com') 列印cookie, print(cookie) print(type(cookie)) cookie.save(ignore_discard=True, ignore_expires=True) 執行結果: from http import cookiejar from urllib.request import HTTPCookieProcessor from urllib import request 如何從檔案中獲取cookie並訪問 指定cookie檔案存在的位置 cookieFilename = 'cookie.txt' 宣告一個MozillaCookie,用來儲存cookie並且可以寫入檔案, 用來讀取檔案中的cookie資訊 cookie = cookiejar.MozillaCookieJar() 從檔案中讀取cookie內容 cookie.load(filename=cookieFilename) 利用urllib.request的HTTPCookieProcessor建立一個cookie處理器 handler = HTTPCookieProcessor(cookie) 通過CookieHandler建立opener 預設使用的openr就是urlopen; opener = request.build_opener(handler) 開啟url頁面 response = opener.open('http://www.baidu.com') 列印資訊 print(response.read().decode('utf-8'))
執行結果:
部分截圖
3.urllib常見異常處理
異常
exception urllib.error.URLError¶
exception urllib.error.HTTPError
exception urllib.error.ContentTooShortError(msg, content)
from urllib import request, error
try:
url = 'https://www.baidu.com/hello.html'
response = request.urlopen(url)
print(response.read().decode('utf-8'))
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
print(e.reason)
else:
print("成功")
執行結果:
超時異常處理
from urllib import request, error
import socket
try:
url = 'https://www.baidu.com'
response = request.urlopen(url, timeout=0.01)
print(response.read().decode('utf-8'))
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
print(e.reason)
if isinstance(e.reason, socket.timeout):
print("超時")
else:
print("成功")
執行結果:
4.requests模組
例項引入 import requests url = 'http://www.baidu.com' response = requests.get(url) print(response) print(response.status_code) print(response.cookies) print(response.text) print(type(response.text)) 執行結果: 常見的請求方式 import requests response = requests.post('http://httpbin.org/post', data={'name' : 'fentiao', 'age':10}) print(response.text) response = requests.delete('http://httpbin.org/delete', data={'name' : 'fentiao'}) print(response.text) 執行結果: 帶引數的get請求 import requests data = { 'start': 20, 'limit': 40, 'sort': 'new_score', 'status': 'P', } url = 'https://movie.douban.com/subject/4864908/comment' response = requests.get(url, params=data) print(response.url) 執行結果: 解析json格式 import requests ip = input("請輸入查詢的IP:") url = "http://ip.taobao.com/service/getIpInfo.php?ip=%s" %(ip) response = requests.get(url) content = response.json() print(content) print(type(content)) 執行結果: 獲取二進位制資料 import requests url = 'https://gss0.bdstatic.com/-4o3dSag_xI4khGkpoWK1HF6hhy/baike/w%3D268%3Bg%3D0/sign=4f7bf38ac3fc1e17fdbf8b3772ab913e/d4628535e5dde7119c3d076aabefce1b9c1661ba.jpg' response = requests.get(url) print(response.text) with open('github.png', 'wb') as f: f.write(response.content) 下載視訊 import requests url = "http://gslb.miaopai.com/stream/sJvqGN6gdTP-sWKjALzuItr7mWMiva-zduKwuw__.mp4" response = requests.get(url) with open('/tmp/learn.mp4', 'wb') as f: f.write(response.content) 新增headers資訊 import requests url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' headers = { 'User-Agent': user_agent } response = requests.get(url, headers=headers) print(response.text) print(response.status_code) 執行結果: 響應資訊的操作 response物件常用屬性 response = requests.get(url, headers=headers) print(response.headers) print(response.url) 狀態碼的判斷 response = requests.get(url, headers=headers) exit() if response.status_code != 200 else print("請求成功") 執行結果: 高階設定 上傳檔案 import requests 上傳的資料資訊(字典儲存) data = {'file':open('github.png', 'rb')} response = requests.post('http://httpbin.org/post', files=data) print(response.text) 執行結果: 獲取cookie資訊 import requests 上傳的資料資訊(字典儲存) response = requests.get('http://www.csdn.net') print(response.cookies) for key, value in response.cookies.items(): print(key + "=" + value) 執行結果: 讀取已經存在的cookie資訊訪問網址內容(會話維持) import requests 上傳的資料資訊(字典儲存) 設定一個cookie: name='westos' s = requests.session() response1 = s.get('http://httpbin.org/cookies/set/name/westos') response2 = s.get('http://httpbin.org/cookies') print(response2.text) 執行結果: 忽略證書驗證 import requests url = 'https://www.12306.cn' response = requests.get(url, verify=False) print(response.status_code) print(response.text) 執行結果:部分截圖 代理設定/設定超時間 import requests proxy = { 'https': '171.221.239.11:808', 'http': '218.14.115.211:3128' } response = requests.get('http://httpbin.org/get', proxies=proxy, timeout=10) print(response.text) 獲取網頁內容的----- urllib, requests 分析網頁常用的模組------ re, bs4(beautifulsoup4)
6.獲取部落格內容
from bs4 import BeautifulSoup import requests import pdfkit url = "https://blog.csdn.net/zcx1203/article/details/83030349" def get_blog_content(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html5lib') 獲取head標籤的內容 head = soup.head 獲取部落格標題 title = soup.find_all(class_="title-article")[0].get_text() 獲取部落格內容 content = soup.find_all(class_="article_content")[0] 寫入本地檔案 other = 'http://passport.csdn.net/account/login?from=' with open('westos.html', 'w') as f: f.write(str(head)) f.write('<h1>%s</h1>\n\n' %(title)) f.write(str(content)) get_blog_content(url) pdfkit.from_file('westos.html', 'westos.pdf')
7.爬取51.cto
import requests
from bs4 import BeautifulSoup
url ='http://blog.51cto.com/13885935/2296519'
def get_blog_content(url):
response = requests.get(url)
soup= BeautifulSoup(response.text,'html5lib')
獲取網頁head標籤內容
head = soup.head
blog_title = soup.find_all(class_="artical-title")
blog_content = soup.find_all(class_='artical-content')
with open('blog_conyent.html','w') as f:
f.write(str(head))
f.write(str('<h1>%s</h1>\n\n' %blog_title))
f.write(str(blog_content))
get_blog_content(url)