1. 程式人生 > >Python------URL解析

Python------URL解析

1.爬蟲

(1)瀏覽網頁時經歷的過程
瀏覽器 (請求request)-> 輸入URL地址(http://www.baidu.com/index.html file:///mnt  ftp://172.25.254.250/pub
-> http協議確定, www.baidu.com訪問的域名確定 -> DNS伺服器解析到IP地址
-> 確定要訪問的網頁內容  -> 將獲取到的頁面內容返回給瀏覽器(響應過程)
)
(2) 爬取網頁
1). 基本方法
from  urllib import  request
from urllib.error import URLError

try:
    respose = request.urlopen('http://www.baidu.com', timeout=0.01)
    content = respose.read().decode('utf-8')
    print(content)
except URLError as e:
    print("訪問超時", e.reason)
2). 使用Resuest物件(可以新增其他的頭部資訊)
from  urllib import  request
from urllib.error import URLError
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
try:
    例項化request物件, 可以自定義請求的頭部資訊;
req = request.Request(url, headers=headers) urlopen不僅可以傳遞url地址, 也可以傳遞request物件; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 執行結果: 後續新增頭部資訊 from urllib import request from urllib.error import URLError url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html' user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' try: 例項化request物件, 可以自定義請求的頭部資訊;
req = request.Request(url) req.add_header('User-Agent',user_agent) urlopen不僅可以傳遞url地址, 也可以傳遞request物件; content = request.urlopen(req).read().decode('utf-8') print(content) except URLError as e: print(e.reason) else: print("success") 執行結果:部分截圖 (3)反爬蟲策略 1). 模擬瀏覽器(同上) 1.Android Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19 Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 2.Firefox Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0 3.Google Chrome Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36 Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19 4.iOS Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3
IP代理 當抓取網站時, 程式的執行速度很快, 如果通過爬蟲去訪問, 一個固定的ip訪問頻率很高, 網站如果做反爬蟲策略, 那麼就會封掉ip; 如何解決? - 設定延遲;time.sleep(random.randint(1,5)) - 使用IP代理, 讓其他IP代替你的IP訪問; 如何獲取代理IP? http://www.xicidaili.com/ 如何實現步驟? 1). 呼叫urllib.request.ProxyHandler(proxies=None); --- 類似理解為Request物件 2). 呼叫Opener--- 類似與urlopen, 這個是定製的 3). 安裝Opener 4). 代理IP的選擇
from  urllib import  request
from urllib.error import URLError
url = 'https://httpbin.org/get'
proxy = {'https':'171.221.239.11:808', 'http':'218.14.115.211:3128'}
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
呼叫urllib.request.ProxyHandler(proxies=None);  --- 類似理解為Request物件
proxy_support = request.ProxyHandler(proxy)
呼叫Opener - -- 類似與urlopen, 這個是定製的
opener = request.build_opener(proxy_support)
偽裝瀏覽器
opener.addheaders = [('User-Agent',user_agent)]
安裝Opener
request.install_opener(opener)
代理IP的選擇
response = request.urlopen(url)
content  = response.read().decode('utf-8')
print(content)

執行結果:

2.儲存cookie資訊

1>cookie資訊是什麼?
cookie, 某些網站為了辨別使用者身份, 只有登陸之後才能訪問某個頁面;
進行一個會話跟蹤, 將使用者的相關資訊包括使用者名稱等儲存到本地終端
"uuid_tt_dd=10_19652618930-1532935869256-858290;
Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1538533772,1538552856,1539399036;
ADHOC_MEMBERSHIP_CLIENT_ID1.0=99bc6db6-14c6-e386-1bd0-89ff0447e258;
 smidV2=20180908162356b9a9b99821267a2d7b2fccd5f4d8129d00a43c058f4e41e20;
 UN=gf_lvah;
 BT=1539399071335;
 dc_session_id=10_1538533770767.652739;
 TY_SESSION_ID=85f3368a-4382-44cc-b49e-5c4e0d13080c;
 dc_tos=pginxg; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1539399076;
 UserName=gf_lvah;
 UserInfo=QAG2Pho%2B1xl0ZfBdwCCAFT9u9yaIgalLraXhrQ7UQ%2FKy9YH9nIraCT%2FwEBU%2BMDLiXwKXmBNQWmFDaRvwXkGFH%2FhbJ3q6ceS69ezJDfFxagisBJar1pLzXWsVJ4A1AqXX; UserNick=MyWestos; AU=516; UserToken=QAG2Pho%2B1xl0ZfBdwCCAFT9u9yaIgalLraXhrQ7UQ%2FKy9YH9nIraCT%2FwEBU%2BMDLiXwKXmBNQWmFDaRvwXkGFH%2FhbJ3q6ceS69ezJDfFxagj0%2BCh019zDdo5BtfMg44vykcNOlmP2fNOJZiwZg1%2B5egpvFxxqU4W42NpY7LWVgV4%3D"
2>實現步驟:
from http import cookiejar
from urllib.request import  HTTPCookieProcessor
from urllib import  request
如何將Cookie儲存到變數中, 或者檔案中;
宣告一個CookieJar ---> FileCookieJar --> MozillaCookie
cookie = cookiejar.CookieJar()
利用urllib.request的HTTPCookieProcessor建立一個cookie處理器
handler = HTTPCookieProcessor(cookie)
通過CookieHandler建立opener
預設使用的openr就是urlopen;
opener = request.build_opener(handler)
開啟url頁面
response = opener.open('http://www.baidu.com')
列印該頁面的cookie資訊
print(cookie)
for item in cookie:
    print(item)
執行結果:



3>如何將Cookie以指定格式儲存到檔案中
設定儲存cookie的檔名
cookieFilename = 'cookie.txt'
宣告一個MozillaCookie,用來儲存cookie並且可以寫入文進阿
cookie = cookiejar.MozillaCookieJar(filename=cookieFilename)
利用urllib.request的HTTPCookieProcessor建立一個cookie處理器
handler = HTTPCookieProcessor(cookie)
通過CookieHandler建立opener
預設使用的openr就是urlopen;
opener = request.build_opener(handler)
開啟url頁面
response = opener.open('http://www.baidu.com')
列印cookie,
print(cookie)
print(type(cookie))
cookie.save(ignore_discard=True, ignore_expires=True)
執行結果:

from http import cookiejar
from urllib.request import  HTTPCookieProcessor
from urllib import  request
如何從檔案中獲取cookie並訪問
指定cookie檔案存在的位置
cookieFilename = 'cookie.txt'
宣告一個MozillaCookie,用來儲存cookie並且可以寫入檔案, 用來讀取檔案中的cookie資訊
cookie = cookiejar.MozillaCookieJar()
從檔案中讀取cookie內容
cookie.load(filename=cookieFilename)
利用urllib.request的HTTPCookieProcessor建立一個cookie處理器
handler = HTTPCookieProcessor(cookie)
通過CookieHandler建立opener
預設使用的openr就是urlopen;
opener = request.build_opener(handler)
開啟url頁面
response = opener.open('http://www.baidu.com')
列印資訊
print(response.read().decode('utf-8'))

執行結果:

部分截圖

3.urllib常見異常處理

異常
     exception urllib.error.URLError¶
     exception urllib.error.HTTPError
     exception urllib.error.ContentTooShortError(msg, content)
from urllib import  request, error
try:
    url = 'https://www.baidu.com/hello.html'
    response = request.urlopen(url)
    print(response.read().decode('utf-8'))
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print("成功")
執行結果:


超時異常處理
from urllib import request, error
import  socket
try:
    url = 'https://www.baidu.com'
    response = request.urlopen(url, timeout=0.01)
    print(response.read().decode('utf-8'))
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
    if isinstance(e.reason, socket.timeout):
        print("超時")
else:
    print("成功")

執行結果:

4.requests模組

例項引入
import requests
url = 'http://www.baidu.com'
response = requests.get(url)
print(response)
print(response.status_code)
print(response.cookies)
print(response.text)
print(type(response.text))
執行結果:



常見的請求方式
import requests
response = requests.post('http://httpbin.org/post', data={'name' : 'fentiao', 'age':10})
print(response.text)
response = requests.delete('http://httpbin.org/delete', data={'name' : 'fentiao'})
print(response.text)
執行結果:



帶引數的get請求
import  requests
data = {
    'start': 20,
    'limit': 40,
    'sort': 'new_score',
    'status': 'P',
}
url = 'https://movie.douban.com/subject/4864908/comment'
response = requests.get(url, params=data)
print(response.url)
執行結果:


解析json格式
import requests
ip = input("請輸入查詢的IP:")
url = "http://ip.taobao.com/service/getIpInfo.php?ip=%s" %(ip)
response = requests.get(url)
content  = response.json()
print(content)
print(type(content))
執行結果:



獲取二進位制資料
import requests
url = 'https://gss0.bdstatic.com/-4o3dSag_xI4khGkpoWK1HF6hhy/baike/w%3D268%3Bg%3D0/sign=4f7bf38ac3fc1e17fdbf8b3772ab913e/d4628535e5dde7119c3d076aabefce1b9c1661ba.jpg'
response = requests.get(url)
print(response.text)
with open('github.png', 'wb') as f:
       f.write(response.content)
下載視訊
import requests
url = "http://gslb.miaopai.com/stream/sJvqGN6gdTP-sWKjALzuItr7mWMiva-zduKwuw__.mp4"
response = requests.get(url)
with open('/tmp/learn.mp4', 'wb') as f:
       f.write(response.content)
新增headers資訊
import requests
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
headers = {
    'User-Agent': user_agent
}
response = requests.get(url, headers=headers)
print(response.text)
print(response.status_code)
執行結果:


響應資訊的操作
response物件常用屬性
response = requests.get(url, headers=headers)
print(response.headers)
print(response.url)
狀態碼的判斷
response = requests.get(url, headers=headers)
exit() if response.status_code != 200 else print("請求成功")
執行結果:

高階設定
上傳檔案
import  requests
上傳的資料資訊(字典儲存)
data = {'file':open('github.png', 'rb')}
response = requests.post('http://httpbin.org/post', files=data)
print(response.text)
執行結果:

獲取cookie資訊
import  requests
上傳的資料資訊(字典儲存)
response = requests.get('http://www.csdn.net')
print(response.cookies)
for key, value in response.cookies.items():
    print(key + "=" + value)
執行結果:

讀取已經存在的cookie資訊訪問網址內容(會話維持)
import  requests
上傳的資料資訊(字典儲存)
設定一個cookie: name='westos'
s = requests.session()
response1 = s.get('http://httpbin.org/cookies/set/name/westos')
response2 = s.get('http://httpbin.org/cookies')
print(response2.text)
執行結果:

忽略證書驗證
import  requests
url = 'https://www.12306.cn'
response = requests.get(url, verify=False)
print(response.status_code)
print(response.text)
執行結果:部分截圖


代理設定/設定超時間
import requests
proxy = {
    'https': '171.221.239.11:808',
    'http': '218.14.115.211:3128'
}
response = requests.get('http://httpbin.org/get', proxies=proxy,  timeout=10)
print(response.text)
獲取網頁內容的----- urllib, requests
分析網頁常用的模組------ re, bs4(beautifulsoup4)

6.獲取部落格內容

from bs4 import  BeautifulSoup
import  requests
import pdfkit
url = "https://blog.csdn.net/zcx1203/article/details/83030349"
def get_blog_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html5lib')
    獲取head標籤的內容
    head = soup.head
    獲取部落格標題
    title = soup.find_all(class_="title-article")[0].get_text()
    獲取部落格內容
    content = soup.find_all(class_="article_content")[0]
    寫入本地檔案
    other = 'http://passport.csdn.net/account/login?from='
    with open('westos.html', 'w') as f:
        f.write(str(head))
        f.write('<h1>%s</h1>\n\n' %(title))
        f.write(str(content))
get_blog_content(url)
pdfkit.from_file('westos.html', 'westos.pdf')

7.爬取51.cto

import requests
from bs4 import BeautifulSoup
url ='http://blog.51cto.com/13885935/2296519'
def get_blog_content(url):
    response = requests.get(url)
    soup= BeautifulSoup(response.text,'html5lib')
    獲取網頁head標籤內容
    head = soup.head
    blog_title = soup.find_all(class_="artical-title")
    blog_content = soup.find_all(class_='artical-content')
    with open('blog_conyent.html','w') as f:
        f.write(str(head))
        f.write(str('<h1>%s</h1>\n\n' %blog_title))
        f.write(str(blog_content))
get_blog_content(url)