Python3網路爬蟲：使用Beautiful Soup爬取小說

阿新 • • 發佈：2018-12-22

本文是http://blog.csdn.net/c406495762/article/details/71158264的學習筆記
作者:Jack-Cui
博主連結:http://blog.csdn.net/c406495762

執行平臺： OSX
Python版本： Python3.x
IDE： pycharm
一、Beautiful Soup簡介

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取資料。官方解釋如下：

Beautiful Soup提供一些簡單的、python式的函式用來處理導航、搜尋、修改分析樹等功能。它是一個工具箱，通過解析文件為使用者提供需要抓取的資料，因為簡單，所以不需要多少程式碼就可以寫出一個完整的應用程式。
Beautiful Soup自動將輸入文件轉換為Unicode編碼，輸出文件轉換為utf-8編碼。你不需要考慮編碼方式，除非文件沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。
Beautiful Soup已成為和lxml、html6lib一樣出色的python直譯器，為使用者靈活地提供不同的解析策略或強勁的速度。

二、Beautiful Soup學習

在這裡推薦 Python爬蟲利器二之Beautiful Soup的用法 ,和官方文件相似,但是內容作了精簡.
另附上官方文件

的連結

三、實戰

小說網站-筆趣看：
URL：http://www.biqukan.com/
以該小說網為例,爬取《神墓》
整體思路:
1. 選擇神墓小說某一章節,檢查,找到正文部分的TAG
2. 嘗試使用BeautifulSoup打印出該章節正文部分內容
3. 從索引爬取全部的章節的url,用for迴圈列印
4. 結合3、2並將2的列印部分換成寫入檔案

具體步驟:

1. 選擇《神墓》小說第一章,檢查,找到正文部分的html標籤
連結url:http://www.biqukan.com/3_3039/1351331.html

這裡寫圖片描述
發現小說正文部分所在標籤是:
<div id='content' class='showtxt'>正文部分</div>

2. 嘗試使用BeautifulSoup打印出該章節正文部分內容

from urllib import request
from bs4 import BeautifulSoup
import os
def download_specified_chapter(chapter_url,header,coding,chapter_name=None):
    #先生成一個request物件,傳入url和headers
    download_req = request.Request(chapter_url,headers=header)
    #通過指定urlopen開啟request物件中的url網址,並獲得對應內容
    response = request.urlopen(download_req)
    #獲取頁面的html
    download_html = response.read().decode(coding, 'ignore')
    #獲取html的bs
    origin_soup = BeautifulSoup(download_html, 'lxml')
    #獲取小說正文部分
    content=origin_soup.find(id='content', class_='showtxt')
    #經列印,發現文字中有眾多的\xa0(在html中是&nbsp;),並且沒有換行,
    print(repr(content.text))
    #整理小說格式,將\xa0替換成回車
    # html中的&nbsp,在轉換成文件後,變成\xa0
    txt=content.text.replace('\xa0'*8,'\n')
    print(txt)
if __name__=="__main__":
    target_url='http://www.biqukan.com/3_3039/1351331.html'
    header = {
        'User-Agent':'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/'
                     '535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'
    }
    download_specified_chapter(target_url,header,'gbk')

結果如圖:
這裡寫圖片描述

3. 從索引爬取全部的章節的url,用for迴圈列印
索引的url:http://www.biqukan.com/3_3039/

檢查後發現
這裡寫圖片描述
我們所需要的章節的url,在<div class="lsitmain">下的<dl>下的<dd>中的<a> 標籤中,並且是在<dt>《神墓》正文卷</dt> 之後.
嘗試用for迴圈列印:

from urllib import request
from bs4 import BeautifulSoup

if __name__ == "__main__":
    index_url = "http://www.biqukan.com/3_3039/"
    header={
        'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/'
                      '535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'
    }
    #指定url,header生成request
    url_req = request.Request(index_url,headers=header)
    #開啟url,並獲得請求內容response
    response = request.urlopen(url_req)
    #讀取response的內容,用gbk解碼,得到html內容
    html = response.read().decode('gbk', 'ignore')
    #用BeautifulSoup處理得到的網頁html
    html_soup = BeautifulSoup(html,'lxml')
    # index = BeautifulSoup(str(html_soup.find_all('div', class_='listmain')),'lxml')
    # print(html_soup.find_all(['dd', ['dt']]))
    #判斷是否找到了《神墓》正文卷
    body_flag = False    for element in html_soup.find_all(['dd', ['dt']]):
        if element.string == '《神墓》正文卷':
            body_flag = True
        if body_flag is True and element.name == 'dd':
            chapter_name = element.string
            chapter_url = "http://www.biqukan.com"+element.a.get('href')
            print(" {} 連結:{}".format(chapter_name,chapter_url))

執行結果:
這裡寫圖片描述

4.結合3、2並將2的列印部分換成寫入檔案
由步驟3 獲得章節的url,再由步驟2 根據url,獲得正文部分,兩相結合,再不斷地將內容寫入檔案中.
程式碼如下:

from urllib import request
from bs4 import BeautifulSoup

def download_specified_chapter(chapter_url, header, coding, chapter_name=None):
    #先生成一個request物件,傳入url和headers
    download_req = request.Request(chapter_url,headers=header)
    #通過指定urlopen開啟request物件中的url網址,並獲得對應內容
    response = request.urlopen(download_req)
    #獲取頁面的html
    download_html = response.read().decode(coding, 'ignore')
    #獲取html的bs
    origin_soup = BeautifulSoup(download_html, 'lxml')
    #獲取小說正文部分
    content=origin_soup.find(id='content', class_='showtxt')

    #整理小說格式,將\xa0替換成回車
    # html中的&nbsp,在轉換成文件後,變成\xa0
    txt=content.text.replace('\xa0'*8,'\n')
    # 將獲得的正文 寫入txt
    print("正在下載 {} 連結:{}".format(chapter_name,chapter_url))
    with open('《神墓》.txt','a') as f:
        if chapter_name is None:
            f.write('\n')
        else :
            f.write('\n'+chapter_name+'\n')
        f.write(txt)
if __name__ == "__main__":
    index_url = "http://www.biqukan.com/3_3039/"
    header={
        'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/'
                      '535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'
    }
    #指定url,header生成request
    url_req = request.Request(index_url,headers=header)
    #開啟url,並獲得請求內容response
    response = request.urlopen(url_req)
    #讀取response的內容,用gbk解碼,得到html內容
    html = response.read().decode('gbk', 'ignore')
    #用BeautifulSoup處理得到的網頁html
    html_soup = BeautifulSoup(html,'lxml')
    # index = BeautifulSoup(str(html_soup.find_all('div', class_='listmain')),'lxml')
    # print(html_soup.find_all(['dd', ['dt']]))
    #判斷是否找到了《神墓》正文卷
    body_flag = False
    for element in html_soup.find_all(['dd', ['dt']]):
        if element.string == '《神墓》正文卷':
            body_flag = True
        #從《神墓》正文卷 之後的dd就是順序的章節目錄
        if body_flag is True and element.name == 'dd':
            chapter_name = element.string
            chapter_url = "http://www.biqukan.com"+element.a.get('href')
            download_specified_chapter(chapter_url, header, 'gbk', chapter_name)

結果如圖:
這裡寫圖片描述

txt截圖:
這裡寫圖片描述

這樣就大功告成 ^_^

Python3網路爬蟲：使用Beautiful Soup爬取小說

一起學爬蟲——使用Beautiful Soup爬取網頁！

一起學爬蟲——使用Beautiful Soup爬取網頁

python3.x爬蟲：按頁爬取淘寶商品列表

Python3網路爬蟲：使用Beautiful Soup爬取小說

Python3 學習4：使用Beautiful Soup爬取小說

Python3網路爬蟲：requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲

Python3網路爬蟲：Scrapy入門實戰之爬取動態網頁圖片

Python3網路爬蟲：requests爬取動態網頁內容

Python3網路爬蟲：今日頭條新聞App的廣告資料抓取

經典爬蟲：用Scrapy爬取百度股票

用etree和Beautiful Soup爬取騰訊招聘網站

Python3網路爬蟲：Scrapy入門之使用ImagesPipline下載圖片

Python3網路爬蟲：初識Scrapy爬蟲框架

Python3網路爬蟲：使用Cookie-模擬登陸

Python爬蟲：Selenium+ BeautifulSoup 爬取JS渲染的動態內容（雪球網新聞）

python網路爬蟲學習(二)一個爬取百度貼吧的爬蟲程式

Python學習筆記——pycharm 爬蟲：Beautiful soup

python爬蟲：利用python爬取微信好友,獲得男女比例。

Python3.X 爬蟲實戰（併發爬取）

爬蟲：Instagram資訊爬取

Python3網路爬蟲：使用Beautiful Soup爬取小說

相關推薦