小說爬蟲——以https://www.tadu.com/book/catalogue/891710/（小說目錄）為例

阿新 • • 發佈：2022-03-25

需求分析

目標網站：https://www.tadu.com/book/catalogue/891710/
目標內容：小說《沒有爸爸也能活》第一章到第十三一章的正文內容。
任務要求：編寫兩個爬蟲，爬蟲1從https://www.tadu.com/book/catalogue/891710/獲取小說《沒有爸爸也能活》第一章到第十三一章的網址，並將網址新增到Redis里名為url_queue的列表中。爬蟲2從Redis里名為url_queue的列表中讀出網址，進入網址爬取每一章的具體內容，再將內容儲存到MongoDB中。
具體步驟：

1.使用正則表示式或者Xpath獲取每一章網址，將它們新增到Redis中。

def get_source(self, url, headers):#獲取url原始碼
        return requests.get(url, headers).content.decode()

def url2redis(self):#將url原始碼中的章節url儲存在redis資料庫中
    source = self.get_source(self.url,self.HEADERS)
    selector = lxml.html.fromstring(source)  # 建立樹物件
    url_lst = selector.xpath('//div[@class="chapter clearfix"]/a/@href')
    for url in url_lst:
        url = 'https://www.tadu.com' + url.strip()
        self.client1.lpush('url_queue', url)

　　　　　2.其次從url_queue中一個個彈出章節url，由於章節內容採取非同步載入的方式，剛開始使用了selenium，通過webdriver來驅動Chrome瀏覽器，來解析JS接收資料，獲取元素原始碼。

def wait(url):#用於獲取
    driver  = webdriver.Chrome(r'.\chromedriver.exe')#用chromedriver.exe驅動Chrome瀏覽器解析原始碼中JS部分
    driver.get(url) #連線網頁
    try:
        WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,"read_details"))) #等待頁面載入
    except Exception as _:
        print('網頁載入太慢。')
    return driver.page_source #返回網頁原始碼

　　3.最後把這些章節內容原始碼，通過Xpath來獲取小說標題和內容，將其加入字典列表裡，用於插入MongoDB資料庫

 def article2mongodb(self):#將各個url中文章內容傳入MongoDB資料庫中
        while self.client1.llen('url_queue')>0:
            url = self.client1.lpop('url_queue').decode()
            html = wait(url)
            selector = lxml.html.fromstring(html)
            chapter_name = selector.xpath('//div[@class="clearfix"]/h4/text()')[0]
            content = selector.xpath('//div[@id="partContent"]/p/text()')
            self.content_lst.append({'title':chapter_name,'content':content})
        self.handler.insert_many(self.content_lst)

　　5.原始碼：

# Author:CK
# -*- coding = utf-8 -*-
# @Time :2022/3/24 17:21
# @Author:ck
# @File :get_article.py
# @Software: PyCharm
from selenium import webdriver#瀏覽器驅動模組
from selenium.webdriver.support.ui import WebDriverWait#瀏覽器請求等待模組
from selenium.webdriver.common.by import By#鎖定元素模組
from selenium.webdriver.support import expected_conditions as EC#期待模組
import requests#請求模組
import lxml.html#原始碼解析模組
import redis #redis資料庫
from  pymongo import MongoClient #MongoDB資料庫

def wait(url):#用於獲取
    driver  = webdriver.Chrome(r'.\chromedriver.exe')#用chromedriver.exe驅動Chrome瀏覽器解析原始碼中JS部分
    driver.get(url) #連線網頁
    try:
        WebDriverWait(driver,30).until(EC.presence_of_element_located((By.CLASS_NAME,"read_details"))) #等待頁面載入
    except Exception as _:
        print('網頁載入太慢。')
    return driver.page_source #返回網頁原始碼

class get_article(object):#獲取文章內容類
    HEADERS = {
        'User - Agent': 'Mozilla / 5.0(WindowsNT10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 99.0.4844.51Safari / 537.36'
    }#頭部
    def __init__(self, url):
        self.url = url
        self.content_lst = []
        self.client1 = redis.StrictRedis()#連線redis資料庫
        self.handler = db['article']#連線資料庫集合
        self.url2redis()
        self.article2mongodb()

    def get_source(self, url, headers):#獲取url原始碼
        return requests.get(url, headers).content.decode()


    def url2redis(self):#將url原始碼中的章節url儲存在redis資料庫中
        source = self.get_source(self.url,self.HEADERS)
        selector = lxml.html.fromstring(source)  # 建立樹物件
        url_lst = selector.xpath('//div[@class="chapter clearfix"]/a/@href')
        for url in url_lst:
            url = 'https://www.tadu.com' + url.strip()
            self.client1.lpush('url_queue', url)

    def article2mongodb(self):#將各個url中文章內容傳入MongoDB資料庫中
        while self.client1.llen('url_queue')>0:
            url = self.client1.lpop('url_queue').decode()
            html = wait(url)
            selector = lxml.html.fromstring(html)
            chapter_name = selector.xpath('//div[@class="clearfix"]/h4/text()')[0]
            content = selector.xpath('//div[@id="partContent"]/p/text()')
            self.content_lst.append({'title':chapter_name,'content':content})
        self.handler.insert_many(self.content_lst)
if __name__ =='__main__':
    client0 =MongoClient()
    db = client0['spicer']#建立資料庫
    article = get_article('https://www.tadu.com/book/catalogue/891710')

　　6.但實際執行，由於每次爬取一個章節內容都會執行一次Chrome解析器，所耗用記憶體過大，要想解決這個問題，得想個其它思路。由於非同步載入，我們可以在瀏覽器開發者工具中找到每個章節的資料檔案。通過對每個資料檔案內容、url、請求方式的解析，從而只需要簡單爬蟲模式即可爬取。思路僅供參考，具體原始碼實現自考。

小說爬蟲——以https://www.tadu.com/book/catalogue/891710/（小說目錄）為例

小說爬蟲——以https://www.tadu.com/book/catalogue/891710/（小說目錄）為例

WPF的xaml中特殊字元表示轉自https://www.cnblogs.com/Laggage/p/10425423.html

flower 轉自：https://www.jianshu.com/p/4a408657ef76

Python實現傳送郵件---轉載至https://www.cnblogs.com/liuqingzheng/articles/10072695.html

https://www.jianshu.com/p/92e1a4497505

[轉]mysql 5.7配置項最詳細的解釋 --- 轉自:https://www.cnblogs.com/xu743876685/p/8642027.html

C#字串和byte之間的互相轉化（轉載出自：https://www.cnblogs.com/liangxiaoking/p/5958456.html）

QT學習例程1—翻金幣教程（教學視訊連結：https://www.bilibili.com/video/BV1g4411H78N?p=61）

使用CSS transition和animation改變漸變狀態著作權歸作者所有。商業轉載請聯絡作者獲得授權,非商業轉載請註明出處。原文: https://www.w3cplus.com/css/

Linux-workqueue講解(https://www.cnblogs.com/vedic/p/11069249.html)

VirtualBox常用網路轉發：https://www.cnblogs.com/harrychinese/p/virtualbox.html

Redis哨兵模式:裝載https://www.cnblogs.com/PatrickLiu/p/8444546.html

https://www.cnblogs.com/sizhizhiyue/p/4820973.html asp.net後臺匯出excel的方法一：使用response匯出excel

移動端資料爬取轉載：https://www.cnblogs.com/bobo-zhang/p/10068994.html

Python SimpleHTTPServer 轉載：https://www.cnblogs.com/nopnog/p/8116848.html

[Java連線MySQL資料庫——含詳細步驟和程式碼](https://www.cnblogs.com/town123/p/8336244.html)

TypeScript 安裝及基礎執行環境搭建 -- 原文來自部落格園使用者[長島冰茶。](https://www.cnblogs.com/gaoyd/p/13529026.html)【未修改完成】】

linux 安裝java（轉自https://www.cnblogs.com/wjup/p/11041274.html）

postman---post請求資料型別(轉自https://www.cnblogs.com/qican/p/11496820.html)

vetctor（搬運自https://www.cnblogs.com/yskn/p/9053161.html）

小說爬蟲——以https://www.tadu.com/book/catalogue/891710/（小說目錄）為例

相關推薦