Python學習之旅(二十八)
Python基礎知識(27):常用內建模組(Ⅲ)
1、urlblib
urllib提供了一系列用於操作URL的功能
url是統一資源定位符,對可以從網際網路上得到的資源的位置和訪問方法的一種簡潔的表示,是網際網路上標準資源的地址
網際網路上的每個檔案都有一個唯一的URL,它包含的資訊指出檔案的位置以及瀏覽器應該怎麼處理它
(1)GET
urllib的request
模組可以非常方便地抓取URL內容,也就是傳送一個GET請求到指定的頁面,然後返回HTTP的響應
#對豆瓣的一個URLhttps://api.douban.com/v2/book/2129650進行抓取,並返回響應 from urllib importrequest with request.urlopen('https://api.douban.com/v2/book/2129650') as f: data = f.read() print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', data.decode('utf-8')) 結果: Status: 200 OK Date: Sun, 09 Dec 2018 01:23:48 GMT Content-Type: application/json; charset=utf-8 Content-Length: 2138 Connection: close Vary: Accept-Encoding X-Ratelimit-Remaining2: 99 X-Ratelimit-Limit2: 100 Expires: Sun, 1 Jan 2006 01:00:00 GMT Pragma: no-cache Cache-Control: must-revalidate, no-cache, private Set-Cookie: bid=fdBz3SLSf0s; Expires=Mon, 09-Dec-19 01:23:48 GMT; Domain=.douban.com; Path=/ X-DOUBAN-NEWBID: fdBz3SLSf0s X-DAE-Node: brand55 X-DAE-App: book Server: dae X-Frame-Options: SAMEORIGIN Data: {"rating":{"max":10,"numRaters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],...}
如果我們要想模擬瀏覽器傳送GET請求,就需要使用Request
物件,通過往Request
物件新增HTTP頭,我們就可以把請求偽裝成瀏覽器
#模擬iPhone 6去請求豆瓣首頁 from urllib import request req = request.Request('http://www.douban.com/') req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') with request.urlopen(req) as f: print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', f.read().decode('utf-8')) 結果: <title>豆瓣(手機版)</title> <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" /> <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0"> <meta name="format-detection" content="telephone=no"> <link rel="canonical" href=" http://m.douban.com/"> <link href="https://img3.doubanio.com/f/talion/4b1de333c0e597678522bd3c3af276ba6c667b95/css/card/base.css" rel="stylesheet">
(2)POST
如果要以POST傳送一個請求,只需要把引數data
以bytes形式傳入
#模擬微博登入,先讀取登入的郵箱和口令 from urllib import request, parse print('Login to weibo.cn...') email = input('Email: ') passwd = input('Password: ') login_data = parse.urlencode([ ('username', email), ('password', passwd), ('entry', 'mweibo'), ('client_id', ''), ('savestate', '1'), ('ec', ''), ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F') ]) req = request.Request('https://passport.weibo.cn/sso/login') req.add_header('Origin', 'https://passport.weibo.cn') req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F') with request.urlopen(req, data=login_data.encode('utf-8')) as f: print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', f.read().decode('utf-8')) 結果: Login to weibo.cn... Email: email Password: password Status: 200 OK Server: nginx/1.6.1 Date: Sun, 09 Dec 2018 02:01:40 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: close Vary: Accept-Encoding Cache-Control: no-cache, must-revalidate Expires: Sat, 26 Jul 1997 05:00:00 GMT Pragma: no-cache Access-Control-Allow-Origin: https://passport.weibo.cn Access-Control-Allow-Credentials: true DPOOL_HEADER: 85-144-160-aliyun-core.jpool.sinaimg.cn Set-Cookie: login=9da7cd806ada2c22779667e8e1c039c2; Path=/ Data: {"retcode":50011002,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"email","errline":669}}
(3)Handler
如果還需要更復雜的控制,比如通過一個Proxy去訪問網站,我們需要利用ProxyHandler
來處理
import urllib proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'}) proxy_auth_handler = urllib.request.ProxyBasicAuthHandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler) with opener.open('http://www.example.com/login.html') as f: pass
2、XML
操作XML有兩種方法:DOM和SAX
DOM會把整個XML讀入記憶體,解析為樹,因此佔用記憶體大,解析慢,優點是可以任意遍歷樹的節點
SAX是流模式,邊讀邊解析,佔用記憶體小,解析快,缺點是我們需要自己處理事件
正常情況下,優先考慮SAX,因為DOM實在太佔記憶體
解析XML
在Python中使用SAX解析XML非常簡潔,通常我們關心的事件是start_element
,end_element
和char_data
,準備好這3個函式,然後就可以解析xml了
<a href="/">python</a> ……
start_element
讀取<a href="/">,
char_data讀取Python,
end_element讀取
</a>
from xml.parsers.expat import ParserCreate class DefaultSaxHandler(object): def start_element(self, name, attrs): print('sax:start_element: %s, attrs: %s' % (name, str(attrs))) def end_element(self, name): print('sax:end_element: %s' % name) def char_data(self, text): print('sax:char_data: %s' % text) xml = r'''<?xml version="1.0"?> <ol> <li><a href="/python">Python</a></li> <li><a href="/ruby">Ruby</a></li> </ol> '''
生成XML
最簡單也是最有效的生成XML的方法是拼接字串
L = [] L.append(r'<?xml version="1.0"?>') L.append(r'<root>') L.append(encode('some & data')) L.append(r'</root>') return ''.join(L)
生成複雜的XML要用JSON
3、HTMLParser
利用HTMLParser,可以把網頁中的文字、影象等解析出來
HTML本質上是XML的子集,但是HTML的語法沒有XML那麼嚴格,所以不能用標準的DOM或SAX來解析HTML。
好Python提供了HTMLParser來非常方便地解析HTML
from html.parser import HTMLParser from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print('<%s>' % tag) def handle_endtag(self, tag): print('</%s>' % tag) def handle_startendtag(self, tag, attrs): print('<%s/>' % tag) def handle_data(self, data): print(data) def handle_comment(self, data): print('<!--', data, '-->') def handle_entityref(self, name): print('&%s;' % name) def handle_charref(self, name): print('&#%s;' % name) parser = MyHTMLParser() parser.feed('''<html> <head></head> <body> <!-- test html parser --> <p>Some <a href=\"#\">html</a> HTML tutorial...<br>END</p> </body></html>''') 結果: <html> <head> </head> <body> <!-- test html parser --> <p> Some <a> html </a> HTML tutorial... <br> END </p> </body> </html>
feed()
方法可以多次呼叫,也就是不一定一次把整個HTML字串都塞進去,可以一部分一部分塞進去。
特殊字元有兩種,一種是英文表示的
,一種是數字表示的Ӓ
,這兩種字元都可以通過Parser解析出來