Python3_爬蟲實踐(爬取電子書)
阿新 • • 發佈:2018-11-23
一、我的小書屋
這個爬蟲能爬取 http://mebook.cc/ 網站的電子書下載路徑。(只是小練習,侵刪)
爬取網站使用了 BeautifulSoup 進行解析,
二、爬取原始碼
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 #!/usr/bin/python 2 # -*- coding: UTF-8 -*- 3 import re 4 import urllib.request 5 from bs4 import BeautifulSoup 6 #程式設計書籍 7 url = "http://mebook.cc/category/gjs/bckf/" 8 #獲得各個書本的連結View Code9 def getbook(url): 10 html_doc = urllib.request.urlopen(url).read() 11 soup = BeautifulSoup(html_doc,"html.parser",from_encoding="GB18030") 12 links = soup.select('#primary .img a') 13 for link in links: 14 str = link['href'] + link['title'] + '\n' 15 print (str)16 bookfile(str) 17 #將各個書本的連結追加儲存到txt檔案(待處理) 18 def bookfile(str): 19 fo = open("file.txt","a") 20 fo.write(str) 21 fo.close() 22 #獲取所有書本連結 23 def test(): 24 getbook(url) 25 for x in range(2,18): 26 url = "http://mebook.cc/category/gjs/bckf/page/" + str(x) 27 try: 28 getbook(url) 29 bookfile("第"+str(x)+"頁\n") 30 except UnicodeEncodeError: 31 pass 32 continue 33 # 獲取各個書本的下載連結 34 def getDownload(id): 35 url = "http://mebook.cc/download.php?id="+id 36 html_doc = urllib.request.urlopen(url).read() 37 soup = BeautifulSoup(html_doc,"html.parser",from_encoding="GB18030") 38 links = soup.select('.list a') 39 for link in links: 40 print (link) 41 pwds = soup.select('.desc p') 42 for pwd in pwds: 43 print (pwd.encode(encoding='utf-8' ,errors = 'strict')) 44 45 #test 46 getDownload(str(25723))
三、爬取結果
四、問題發現
4.1、Python3爬取網站資訊時的gbk編碼問題
Python預設字元是ASCII的,decode('GBK')或decode('GB18030')都不成
考慮進行字串處理,參考:https://www.yiibai.com/python/python_strings.html