python採集小說網站完整教程(附完整程式碼)
阿新 • • 發佈:2020-12-25
python 採集網站資料,本教程用的是scrapy蜘蛛
1、安裝Scrapy框架
命令列執行:
pip install scrapy
安裝的scrapy依賴包和原先你安裝的其他python包有衝突話,推薦使用Virtualenv安裝
安裝完成後,隨便找個資料夾建立爬蟲
scrapy startproject 你的蜘蛛名稱
資料夾目錄
爬蟲規則寫在spiders目錄下
items.py
——需要爬取的資料
pipelines.py
——執行資料儲存
settings
—— 配置
middlewares.py
——下載器
下面是採集一個小說網站的原始碼
先在items.py定義採集的資料
# author 小白<qq群:810735403>
import scrapy
class BookspiderItem(scrapy.Item):
# define the fields for your item here like:
i = scrapy.Field()
book_name = scrapy.Field()
book_img = scrapy.Field()
book_author = scrapy.Field()
book_last_chapter = scrapy.Field( )
book_last_time = scrapy.Field()
book_list_name = scrapy.Field()
book_content = scrapy.Field()
pass
編寫採集規則
# author 小白<qq群:810735403>
import scrapy
from ..items import BookspiderItem
class Book(scrapy.Spider):
name = "BookSpider"
start_urls = [
'http://www.xbiquge.la/xiaoshuodaquan/'
]
def parse(self, response):
bookAllList = response.css('.novellist:first-child>ul>li')
for all in bookAllList:
booklist = all.css('a::attr(href)').extract_first()
yield scrapy.Request(booklist,callback=self.list)
def list(self,response):
book_name = response.css('#info>h1::text').extract_first()
book_img = response.css('#fmimg>img::attr(src)').extract_first()
book_author = response.css('#info p:nth-child(2)::text').extract_first()
book_last_chapter = response.css('#info p:last-child::text').extract_first()
book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()
bookInfo = {
'book_name':book_name,
'book_img':book_img,
'book_author':book_author,
'book_last_chapter':book_last_chapter,
'book_last_time':book_last_time
}
list = response.css('#list>dl>dd>a::attr(href)').extract()
i = 0
for var in list:
i += 1
bookInfo['i'] = i # 獲取抓取時的順序,儲存資料時按順序儲存
yield scrapy.Request('http://www.xbiquge.la'+var,meta=bookInfo,callback=self.info)
def info(self,response):
self.log(response.meta['book_name'])
content = response.css('#content::text').extract()
item = BookspiderItem()
item['i'] = response.meta['i']
item['book_name'] = response.meta['book_name']
item['book_img'] = response.meta['book_img']
item['book_author'] = response.meta['book_author']
item['book_last_chapter'] = response.meta['book_last_chapter']
item['book_last_time'] = response.meta['book_last_time']
item['book_list_name'] = response.css('.bookname h1::text').extract_first()
item['book_content'] = ''.join(content)
yield item
儲存資料
import os
class BookspiderPipeline(object):
def process_item(self, item, spider):
curPath = 'E:/小說/'
tempPath = str(item['book_name'])
targetPath = curPath + tempPath
if not os.path.exists(targetPath):
os.makedirs(targetPath)
book_list_name = str(str(item['i'])+item['book_list_name'])
filename_path = targetPath+'/'+book_list_name+'.txt'
print('------------')
print(filename_path)
with open(filename_path,'a',encoding='utf-8') as f:
f.write(item['book_content'])
return item
執行
scrapy crawl BookSpider
即可完成一個小說程式的採集
這裡推薦使用
scrapy shell 爬取的網頁url
然後 response.css('')
測試規則是否正確
在這裡還是要推薦下我自己建的Python開發學習群:810735403
,群裡都是學Python開發的,如果你正在學習Python ,歡迎你加入,大家都是軟體開發黨,不定期分享乾貨(只有Python軟體開發相關的),包括我自己整理的一份2020最新的Python進階資料和高階開發教程,歡迎進階中和進想深入Python的小夥伴!