1. 程式人生 > >Python3_爬蟲實踐(爬取電子書)

Python3_爬蟲實踐(爬取電子書)

一、我的小書屋

  這個爬蟲能爬取  http://mebook.cc/  網站的電子書下載路徑。(只是小練習,侵刪)

 

  爬取網站使用了  BeautifulSoup  進行解析,

二、爬取原始碼

 1 #!/usr/bin/python
 2 # -*- coding: UTF-8 -*-
 3 import re
 4 import urllib.request
 5 from bs4 import BeautifulSoup
 6 #程式設計書籍
 7 url = "http://mebook.cc/category/gjs/bckf/"
 8 #獲得各個書本的連結
9 def getbook(url): 10 html_doc = urllib.request.urlopen(url).read() 11 soup = BeautifulSoup(html_doc,"html.parser",from_encoding="GB18030") 12 links = soup.select('#primary .img a') 13 for link in links: 14 str = link['href'] + link['title'] + '\n' 15 print (str)
16 bookfile(str) 17 #將各個書本的連結追加儲存到txt檔案(待處理) 18 def bookfile(str): 19 fo = open("file.txt","a") 20 fo.write(str) 21 fo.close() 22 #獲取所有書本連結 23 def test(): 24 getbook(url) 25 for x in range(2,18): 26 url = "http://mebook.cc/category/gjs/bckf/page/" + str(x) 27 try
: 28 getbook(url) 29 bookfile(""+str(x)+"頁\n") 30 except UnicodeEncodeError: 31 pass 32 continue 33 # 獲取各個書本的下載連結 34 def getDownload(id): 35 url = "http://mebook.cc/download.php?id="+id 36 html_doc = urllib.request.urlopen(url).read() 37 soup = BeautifulSoup(html_doc,"html.parser",from_encoding="GB18030") 38 links = soup.select('.list a') 39 for link in links: 40 print (link) 41 pwds = soup.select('.desc p') 42 for pwd in pwds: 43 print (pwd.encode(encoding='utf-8' ,errors = 'strict')) 44 45 #test 46 getDownload(str(25723))
View Code

 

三、爬取結果

  

 

 四、問題發現

  4.1、Python3爬取網站資訊時的gbk編碼問題

    Python預設字元是ASCII的,decode('GBK')或decode('GB18030')都不成

    考慮進行字串處理,參考:https://www.yiibai.com/python/python_strings.html