Python3爬蟲(1)_使用Urllib進行網絡爬取
網絡爬蟲
又被稱為網頁蜘蛛,網絡機器人,在FOAF社區中間,更經常的稱為網頁追逐者,是一種按照一定的規則,自動地抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。
(參考百度百科,詳細請見https://baike.baidu.com/item/網絡爬蟲/5162711?fr=aladdin&fromid=22046949&fromtitle=%E7%88%AC%E8%99%AB)
代碼和步驟說明:借鑒http://cuijiahua.com。 https://blog.csdn.net/c406495762/article/details/58716886
Urllib
urllib是一個URL處理包,這個包中集合了一些處理URL的模塊,如下:
- 打開和讀取URL:urllib.request
- 包含request產生的錯誤,可以使用try進行捕捉處理:urllib.error
- 包含解析URLs的方法:urllib.parse
- urllib.robotparser模塊用來解析robots.txt文本文件.它提供了一個單獨的RobotFileParser類,通過該類提供的can_fetch()方法測試爬蟲是否可以下載一個頁面
urllib_test01.py
1 from urllib import request 2 3 if __name__=="__main__": 4 response=request.urlopen("http://i.cnblogs.com") 5 html=response.read() 6 print(html)
運行結果:
>>> RESTART: C:\Users\DELL\AppData\Local\Programs\Python\Python36\urllib_test01.py b‘\r\n<!DOCTYPE html>\r\n<html>\r\n<head>\r\n <meta charset="utf-8" />\r\n <meta name="viewport" content="width=device-width" />\r\n <title>\xe7\x94\xa8\xe6\x88\xb7\xe7\x99\xbb\xe5\xbd\x95 - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title>\r\n <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" />\r\n <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" />\r\n <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" /> \r\n <script src="/scripts/jquery.min.js"></script>\r\n <script src="/scripts/bootstrap/js/bootstrap.min.js"></script>\r\n <script src="/scripts/ladda/spin.min.js"></script>\r\n <script src="/scripts/ladda/ladda.min.js"></script>\r\n <script src="/scripts/jsencrypt.min.js"></script>\r\n <script>\r\n var return_url = \‘http://i.cnblogs.com/\‘;\r\n var ajax_url = \‘/user\‘ + \‘/signin\‘;\r\n var enable_captcha = false;\r\n var is_in_progress = false;\r\n </script>\r\n <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>\r\n\r\n</head>\r\n<body onload="setFocus()">\r\n <div style="width: 100%;">\r\n <div align="center">\r\n <div id="Main">\r\n <noscript>\r\n <div style="font-size:15px;margin-bottom:20px;">\r\n \xe6\x82\xa8\xe7\x9a\x84\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xe6\x9c\xaa\xe5\x90\xaf\xe7\x94\xa8Javascript\xef\xbc\x8c\xe6\x97\xa0\xe6\xb3\x95\xe8\xbf\x9b\xe8\xa1\x8c\xe7\x99\xbb\xe5\xbd\x95\xe3\x80\x82\r\n </div>\r\n <style>\r\n form {\r\n display: none;\r\n }\r\n </style>\r\n </noscript>\r\n <form method="post" onsubmit="return false;">\r\n <div id="Heading">\xe7\x99\xbb\xe5\xbd\x95\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad - \xe4\xbb\xa3\xe7\xa0\x81\xe6\x94\xb9\xe5\x8f\x98\xe4\xb8\x96\xe7\x95\x8c</div>\r\n <div class="block">\r\n <label class="label-line">\xe7\x99\xbb\xe5\xbd\x95\xe7\x94\xa8\xe6\x88\xb7\xe5\x90\x8d(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">\xe6\x89\xbe\xe5\x9b\x9e</a>)</label>\r\n <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span>\r\n </div>\r\n <div class="block">\r\n <label class="label-line">\xe5\xaf\x86\xe7\xa0\x81(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">\xe9\x87\x8d\xe7\xbd\xae</a>)</label>\r\n <input type="password" id="input2" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span>\r\n </div>\r\n\r\n <div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true">\r\n <div class="modal-dialog">\r\n <div class="modal-content center-block">\r\n <div class="modal-header">\r\n <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">×</span><span class="sr-only">Close</span></button>\r\n <h4 class="modal-title">\r\n \xe8\xaf\xb7\xe5\xae\x8c\xe6\x88\x90\xe4\xba\xba\xe6\x9c\xba\xe8\xaf\x86\xe5\x88\xab\xe9\xaa\x8c\xe8\xaf\x81\r\n </h4>\r\n </div>\r\n <div class="modal-body">\r\n <div id="showLoading" class="ladda-button" data-style="zoom-in"></div>\r\n <div id="captchaBox" class="center-block">\r\n <span id="geetestLoading"> \xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81\xe7\xbb\x84\xe4\xbb\xb6\xe5\x8a\xa0\xe8\xbd\xbd\xe4\xb8\xad,\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90\x8e...</span>\r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n\r\n <div class="block">\r\n <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">\xe4\xb8\x8b\xe6\xac\xa1\xe8\x87\xaa\xe5\x8a\xa8\xe7\x99\xbb\xe5\xbd\x95</label>\r\n </div>\r\n <div class="block">\r\n <input type="submit" id="signin" class="button" value="\xe5\x8a\xa0\xe8\xbd\xbd\xe4\xb8\xad..." /> <span id="tip_btn" class="tip"></span>\r\n </div>\r\n <div class="block nav">\r\n » <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="\xe6\xb3\xa8\xe5\x86\x8c\xe6\x88\x90\xe4\xb8\xba\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe7\x94\xa8\xe6\x88\xb7">\xe7\xab\x8b\xe5\x8d\xb3\xe6\xb3\xa8\xe5\x86\x8c</a><br />\r\n » <a href="http://www.cnblogs.com/ContactUs.aspx">\xe5\x8f\x8d\xe9\xa6\x88\xe9\x97\xae\xe9\xa2\x98</a>\r\n </div>\r\n </form>\r\n <div style="clear: both" />\r\n </div>\r\n </div>\r\n </div>\r\n</body>\r\n</html>\r\n‘ >>> |
我們爬取完網站後,得到的是一堆二進制碼。按照正常的流程,瀏覽器從服務器端得到信息後會進行解析,然後展示給我們看。而我們現在可以通過簡單的decode()命令將網頁的信息進行解碼,並顯示出來,更新代碼為:
from urllib import request if __name__=="__main__": response=request.urlopen("http://i.cnblogs.com") html=response.read() html = html.decode("utf-8") print(html)
展示為:
Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
<html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width" /> <title>用戶登錄 - 博客園</title> <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" /> <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" /> <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" /> <script src="/scripts/jquery.min.js"></script> <script src="/scripts/bootstrap/js/bootstrap.min.js"></script> <script src="/scripts/ladda/spin.min.js"></script> <script src="/scripts/ladda/ladda.min.js"></script> <script src="/scripts/jsencrypt.min.js"></script> <script> var return_url = ‘http://i.cnblogs.com/‘; var ajax_url = ‘/user‘ + ‘/signin‘; var enable_captcha = false; var is_in_progress = false; </script> <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>
</head> <body onload="setFocus()"> <div style="width: 100%;"> <div align="center"> <div id="Main"> <noscript> <div style="font-size:15px;margin-bottom:20px;"> 您的瀏覽器未啟用Javascript,無法進行登錄。 </div> <style> form { display: none; } </style> </noscript> <form method="post" onsubmit="return false;"> <div id="Heading">登錄博客園 - 代碼改變世界</div> <div class="block"> <label class="label-line">登錄用戶名(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">找回</a>)</label> <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span> </div> <div class="block"> <label class="label-line">密碼(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">重置</a>)</label> <input type="password" id="input2" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span> </div>
<div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content center-block"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">×</span><span class="sr-only">Close</span></button> <h4 class="modal-title"> 請完成人機識別驗證 </h4> </div> <div class="modal-body"> <div id="showLoading" class="ladda-button" data-style="zoom-in"></div> <div id="captchaBox" class="center-block"> <span id="geetestLoading"> 驗證碼組件加載中,請稍後...</span> </div> </div> </div> </div> </div>
<div class="block"> <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">下次自動登錄</label> </div> <div class="block"> <input type="submit" id="signin" class="button" value="加載中..." /> <span id="tip_btn" class="tip"></span> </div> <div class="block nav"> » <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="註冊成為博客園用戶">立即註冊</a><br /> » <a href="http://www.cnblogs.com/ContactUs.aspx">反饋問題</a> </div> </form> <div style="clear: both" /> </div> </div> </div> </body> </html>
|
自動獲取網頁編碼方式的方法
安裝第三方庫chardet,它是用來判斷編碼的模塊,打開cmd,只需要輸入指令:
pip install chardet
即可進行下載。
新的代碼:
# -*- coding: UTF-8 -*- from urllib import request import chardet if __name__ == "__main__": response = request.urlopen("http://i.cnblogs.com/") html = response.read() charset = chardet.detect(html) print(charset)
返回的結果是一個字典,會告知我們網頁的編碼方式。
Python3爬蟲(1)_使用Urllib進行網絡爬取