1. 程式人生 > 實用技巧 >爬蟲實戰:爬取免費小說

爬蟲實戰:爬取免費小說

1.爬蟲實戰專案,爬取小說,只能爬取免費小說(VIP小說需要充錢登陸:方法有所差異,後續會進行講解)

  本教程出於學習目的,如有犯規,請留言聯絡

  爬取網站:起點中文網,盜墓筆記免費篇

  https://book.qidian.com/info/68223#Catalog

2.網頁結構分析

結構分析發現:每一大標題在div元素裡面,是否免費,包含在div元素的孫子元素span的類屬性裡面(class='free' 還是 class='vip')

因此:如果我們想要提取免費章節小說,需要先根據span元素進行判斷。

3.完整程式碼

#!/usr/bin/env python
#-*- coding:utf-8 -*-
'''爬取盜墓筆記小說免費版 ''' import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36' } class Story(object): def __init__(self,url): self.url = url
def get_html(self,url): try: response = requests.get(url,headers=headers) if response.status_code == 200: return response.text else: return None except Exception as e: print('wrong', e) def get_soup(self,html):
try: soup = BeautifulSoup(html,'html.parser') except: soup = BeautifulSoup(html, 'xml') return soup def start(self): html = self.get_html(self.url) soup = self.get_soup(html) try: free_result = soup.select('div.volume span.free') if free_result: for free in free_result: chapters = free.parent.parent.select('li a') # 理解為什麼要找到parent元素 for chapter in chapters: title = chapter.text.strip().replace(' ', '_') href = 'https:' + chapter['href'] html = self.get_html(href) soup = self.get_soup(html) content = soup.select('div.read-content')[0].text.strip().replace('\u3000', ' ') print('\033[1;34m開始爬取: {title}\033[0m'.format(**locals())) with open(title+'.txt', 'w') as fw: fw.write(content) except: None if __name__ == '__main__': url = 'https://book.qidian.com/info/68223#Catalog' gg = Story(url) gg.start()