1. 程式人生 > >聰哥哥教你學Python之爬取金庸系列的小說

聰哥哥教你學Python之爬取金庸系列的小說

話不多說,程式碼貼起:

# -*- coding: utf-8 -*-
import urllib.request  
from bs4 import BeautifulSoup

#獲取每本書的章節內容
def get_chapter(url):
    # 獲取網頁的原始碼
    html = urllib.request.urlopen(url)  
    content = html.read().decode('utf8')
    html.close()
    # 將網頁原始碼解析成HTML格式
    soup = BeautifulSoup(content, "lxml")
    title = soup.find('h1').text    #獲取章節的標題
    text = soup.find('div', id='htmlContent')    #獲取章節的內容
    #處理章節的內容,使得格式更加整潔、清晰
    content = text.get_text('\n','br/').replace('\n', '\n    ')
    content = content.replace('  ', '\n  ')
    return title, '    '+content

def main():
    # 書本列表
    books = ['射鵰英雄傳','天龍八部','鹿鼎記','神鵰俠侶','笑傲江湖','碧血劍','倚天屠龍記',\
             '飛狐外傳','書劍恩仇錄','連城訣','俠客行','越女劍','鴛鴦刀','白馬嘯西風',\
             '雪山飛狐']
    order = [1,2,3,4,5,6,7,8,10,11,12,14,15,13,9]  #order of books to scrapy
    #list to store each book's scrapying range
    page_range = [1,43,94,145,185,225,248,289,309,329,341,362,363,364,375,385]

    for i,book in enumerate(books):
        for num in range(page_range[i],page_range[i+1]):
            url = "http://jinyong.zuopinj.com/%s/%s.html"%(order[i],num)
            # 錯誤處理機制
            try:
                title, chapter = get_chapter(url)
                with open('D://book/%s.txt'%book, 'a', encoding='gb18030') as f:
                    print(book+':'+title+'-->寫入成功!')
                    f.write(title+'\n\n\n')
                    f.write(chapter+'\n\n\n')
            except Exception as e:
                print(e) 
    print('全部寫入完畢!')

main()

最終的結果是這樣的:

 

將對應的書寫入對應的txt,開啟閱讀,確實有點體驗不好,但是聰哥哥金點子,給你提建議:

通過如下網站,可將txt轉為pdf

http://www.pdfdo.com/txt-to-pdf.aspx

所以最後的結果是:

 

轉為pdf後,閱讀體驗更好了。希望這篇文章能給廣大的小夥伴們幫助。