簡單爬蟲爬取起點中文網小說（僅學習）

阿新 • • 發佈：2021-01-30

技術標籤：python html 爬蟲

前言

實習期間自學了vba，現在開始撿回以前上課學過的python，在此記錄學習程序

本文內容僅用於學習，請勿商用

一、爬蟲思路

無需登入的頁面只需要用到簡單爬蟲，獲取小說目錄、通過目錄獲取小說正文即可。

二、使用步驟

1.引入庫

程式碼如下（示例）：

import requests,sys
from bs4 import BeautifulSoup

2.讀取頁面

程式碼如下（示例）：

target = 'https://book.qidian.com/info/1024995653#Catalog' 

req = requests.get(url=target)

為防止頁面出錯、頁面亂碼問題，分別加入：

req.raise_for_status()
req.encoding = req.apparent_encoding

此時即可看到網頁HTML：

html = req.text

3.分析HTML

在HTML程式碼中，我們要找到對應目錄的文字和連結，以及承載這兩個資訊的標籤：
利用審查元素檢視HTML標籤資訊
在小說目錄頁面按下F12，觀察頁面的HTML，可以發現目錄是在一個class=‘catalog-content-wrap’、id=‘j-catalogWrap’的div標籤下的。繼續分析，發現還有volume-wrap，volume等子標籤作為目錄的容器：

一直向下延伸到帶有連結的a標籤，定位到目標，分析完畢。
最小單位標籤

bf = BeautifulSoup(html,"html.parser")
catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')
volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')
volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

3.從標籤中取出資訊

仍然是利用BS直接取出volume中所有的a標籤，並且把其中的文字和對應的href存起來。

aList = volumeDiv.find_all('a')
for a in aList:
	chapterName = a.string
	chapterHref = a.get('href')

這樣整個目錄就檢索完成了，開始利用Href爬取正文。

4.爬取正文

先隨便選擇一個連結開啟，觀察正文的HTML：
情況1
情況2
發現格式會有兩種情況，一種直接用p標籤裝起來，一種是p中帶有span，用class=content-wrap的span裝起來。
但是首先他們都一定是在class=‘read-content j_readContent’的div下，因此直接定位：

req = requests.get(url=chapterHref)
req.raise_for_status()
req.encoding = req.apparent_encoding
html = req.text
bf = BeautifulSoup(html,"html.parser")
mainTextWrapDiv = bf.find('div',class_='main-text-wrap')
readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')
readContent = readContentDiv.find_all('span',class_='content-wrap')

這時已經可以拿到帶有標籤的正文部分了，由於連結不同，會導致標籤格式不同，因此用判斷區分：

if readContent == []:
     textContent = readContentDiv.text.replace('<p>','\r\n')
     textContent = textContent.replace('</p>','')
else:
     for content in readContent:
          if content.string == '':
               print('error format')
          else:
               textContent += content.string + '\r\n'

正文內容獲取完畢。

現在只需遍歷就能獲取整部小說啦！

總結

以下為完整程式碼：

#!/usr/bin/env python3
# coding=utf-8
# author:sakuyo
#----------------------------------
import requests,sys
from bs4 import BeautifulSoup

class downloader(object):
    def __init__(self,target):#初始化
        self.target = target
        self.chapterNames = []
        self.chapterHrefs = []
        self.chapterNum = 0
        self.session = requests.Session()
    def GetChapterInfo(self):#獲取章節名稱和連結
        req = self.session.get(url=self.target)
        req.raise_for_status()
        req.encoding = req.apparent_encoding
        html = req.text
        bf = BeautifulSoup(html,"html.parser")
        catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')
        volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')
        volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

        for volumeDiv in volumeDivs:
            aList = volumeDiv.find_all('a')
            for a in aList:
                chapterName = a.string
                chapterHref = a.get('href')
                self.chapterNames.append(chapterName)
                self.chapterHrefs.append('https:'+chapterHref)
            self.chapterNum += len(aList)
    def GetChapterContent(self,chapterHref):#獲取章節內容
        req = self.session.get(url=chapterHref)
        req.raise_for_status()
        req.encoding = req.apparent_encoding
        html = req.text
        bf = BeautifulSoup(html,"html.parser")
        mainTextWrapDiv = bf.find('div',class_='main-text-wrap')
        readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')
        readContent = readContentDiv.find_all('span',class_='content-wrap')
        if readContent == []:
            textContent = readContentDiv.text.replace('<p>','\r\n')
            textContent = textContent.replace('</p>','')
        else:
            for content in readContent:
                if content.string == '':
                    print('error format')
                else:
                    textContent += content.string + '\r\n'
        return textContent
    def writer(self, path, name='', content=''):
        write_flag = True
        with open(path, 'a', encoding='utf-8') as f: #a模式意為向同名檔案尾增加文字
            if name == None:
                name=''
            if content == None:
                content = ''
            f.write(name + '\r\n')
            f.writelines(content)
            f.write('\r\n')

if __name__ == '__main__':#執行層
    target = 'https://book.qidian.com/info/1024995653#Catalog'
    dlObj = downloader(target)
    dlObj.GetChapterInfo()
    print('開始下載：')
    for i in range(dlObj.chapterNum):
        try:
            dlObj.writer( 'test.txt',dlObj.chapterNames[i], dlObj.GetChapterContent(dlObj.chapterHrefs[i]))
        except Exception:
            print('下載出錯，已跳過')
            pass
        sys.stdout.write("  已下載:%.3f%%" %  float(i/dlObj.chapterNum) + '\r')
        sys.stdout.flush()
    print('下載完成')

簡單爬蟲爬取起點中文網小說（僅學習）

目錄

前言

一、爬蟲思路

二、使用步驟

1.引入庫

2.讀取頁面

3.分析HTML

3.從標籤中取出資訊

4.爬取正文

總結

簡單爬蟲爬取起點中文網小說（僅學習）

Python爬蟲爬取煎蛋網圖片程式碼例項

python爬蟲爬取筆趣網小說網站過程圖解

python+Selenium 爬蟲爬取慕課網課程評價，並儲存為excel

爬蟲爬取鏈家網資訊並可視化

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

python爬蟲學習：從資料庫讀取目標爬蟲站點及爬蟲規程，批量爬取目標站點制定資料（scrapy框架）

python爬蟲實戰之爬取任意百度圖片（升級版）

Python爬蟲實戰，requests+xlwt模組，爬取螺螄粉商品資料（附原始碼）

Python爬蟲實戰，requests+openpyxl模組，爬取手機商品資訊資料（附原始碼）

桌布不嫌棄多，今天帶你爬取動漫桌布網站（福利哦）

Browsermob-Proxy（Selenium）爬取瀏覽器獲取Har資訊（含例項）

起點中文網月票榜爬取及資料分析

Python爬蟲 scrapy框架爬取某招聘網存入mongodb解析

Python爬蟲例項——scrapy框架爬取拉勾網招聘資訊

python來爬取煎蛋網隨手拍小姐姐圖片

教你如何使用Python爬蟲爬取美團美食資料！外賣小專家的報到了！

python3爬蟲爬取網頁圖片簡單示例

Python爬蟲實戰專案：簡單的爬取某度新聞

python爬蟲爬取淘寶商品比價(附淘寶反爬蟲機制解決小辦法)

簡單爬蟲爬取起點中文網小說（僅學習）

目錄

前言

一、爬蟲思路

二、使用步驟

1.引入庫

2.讀取頁面

3.分析HTML

3.從標籤中取出資訊

4.爬取正文

總結

相關推薦