python採集小說網站完整教程（附完整程式碼）

阿新 • • 發佈：2020-12-25

python 採集網站資料，本教程用的是scrapy蜘蛛

1、安裝Scrapy框架

命令列執行：

 pip install scrapy

安裝的scrapy依賴包和原先你安裝的其他python包有衝突話，推薦使用Virtualenv安裝

安裝完成後，隨便找個資料夾建立爬蟲

scrapy startproject 你的蜘蛛名稱

資料夾目錄
在這裡插入圖片描述
爬蟲規則寫在spiders目錄下

items.py ——需要爬取的資料

pipelines.py ——執行資料儲存

settings —— 配置

middlewares.py——下載器

下面是採集一個小說網站的原始碼

先在items.py定義採集的資料

# author 小白<qq群：810735403>

import scrapy


class BookspiderItem(scrapy.Item):
    # define the fields for your item here like:
    i = scrapy.Field()
    book_name = scrapy.Field()
    book_img = scrapy.Field()
    book_author = scrapy.Field()
    book_last_chapter = scrapy.Field( 
)
    book_last_time = scrapy.Field()
    book_list_name = scrapy.Field()
    book_content = scrapy.Field()
    pass

編寫採集規則

# author 小白<qq群：810735403>

import scrapy
from ..items import BookspiderItem
class Book(scrapy.Spider):
    name = "BookSpider"
    start_urls = [
        'http://www.xbiquge.la/xiaoshuodaquan/' 

    ]

    def parse(self, response):
        bookAllList = response.css('.novellist:first-child>ul>li')

        for all in bookAllList:
            booklist = all.css('a::attr(href)').extract_first()
            yield scrapy.Request(booklist,callback=self.list)

    def list(self,response):
        book_name = response.css('#info>h1::text').extract_first()
        book_img = response.css('#fmimg>img::attr(src)').extract_first()
        book_author = response.css('#info p:nth-child(2)::text').extract_first()
        book_last_chapter = response.css('#info p:last-child::text').extract_first()
        book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()
        bookInfo = {
            'book_name':book_name,
            'book_img':book_img,
            'book_author':book_author,
            'book_last_chapter':book_last_chapter,
            'book_last_time':book_last_time
        }
        list = response.css('#list>dl>dd>a::attr(href)').extract()
        i = 0
        for var in list:
            i += 1
            bookInfo['i'] = i # 獲取抓取時的順序，儲存資料時按順序儲存
            yield scrapy.Request('http://www.xbiquge.la'+var,meta=bookInfo,callback=self.info)

    def info(self,response):
        self.log(response.meta['book_name'])
        content = response.css('#content::text').extract()
        item = BookspiderItem()
        item['i'] = response.meta['i']
        item['book_name'] = response.meta['book_name']
        item['book_img'] = response.meta['book_img']
        item['book_author'] = response.meta['book_author']
        item['book_last_chapter'] = response.meta['book_last_chapter']
        item['book_last_time'] = response.meta['book_last_time']
        item['book_list_name'] = response.css('.bookname h1::text').extract_first()

        item['book_content'] = ''.join(content)
        yield item

儲存資料

import os
class BookspiderPipeline(object):

    def process_item(self, item, spider):

        curPath = 'E:/小說/'
        tempPath = str(item['book_name'])

        targetPath = curPath + tempPath
        if not os.path.exists(targetPath):
            os.makedirs(targetPath)
        book_list_name = str(str(item['i'])+item['book_list_name'])
        filename_path = targetPath+'/'+book_list_name+'.txt'
        print('------------')
        print(filename_path)

        with open(filename_path,'a',encoding='utf-8') as f:
            f.write(item['book_content'])
        return item

執行

scrapy crawl  BookSpider

即可完成一個小說程式的採集

這裡推薦使用

scrapy shell 爬取的網頁url

然後 response.css('') 測試規則是否正確

在這裡還是要推薦下我自己建的Python開發學習群:810735403，群裡都是學Python開發的，如果你正在學習Python ，歡迎你加入，大家都是軟體開發黨，不定期分享乾貨（只有Python軟體開發相關的），包括我自己整理的一份2020最新的Python進階資料和高階開發教程，歡迎進階中和進想深入Python的小夥伴！