scrapycrawl 爬取筆趣閣小說
前言
第一次發到博客上..不太會排版見諒
最近在看一些爬蟲教學的視頻,有感而發,大學的時候看盜版小說網站覺得很能賺錢,心想自己也要搞個,正好想爬點小說能不能試試做個網站(網站搭建啥的都不會...)
站點擁有的全部小說不全,只能使用crawl爬全站
不過寫完之後發現用scrapy爬的也沒requests多線程爬的快多少,保存也不好一本保存,由於scrapy是異步爬取,不好保存本地為txt文件,只好存mongodb 捂臉
下面是主代碼
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors importLinkExtractor from scrapy.spiders import CrawlSpider, Rule from biquge5200.items import Biquge5200Item class BqgSpider(CrawlSpider): name = ‘bqg‘ allowed_domains = [‘bqg5200.com‘] start_urls = [‘https://www.bqg5200.com/‘] rules = ( Rule(LinkExtractor(allow=r‘https://www.bqg5200.com/book/\d+/‘), follow=True), Rule(LinkExtractor(allow=r‘https://www.bqg5200.com/xiaoshuo/\d+/\d+/‘), follow=False), Rule(LinkExtractor(allow=r‘https://www.bqg5200.com/xiaoshuo/\d+/\d+/\d+/‘), callback=‘parse_item‘, follow=False), ) def parse_item(self, response): name= response.xpath(‘//div[@id="smallcons"][1]/h1/text()‘).get() zuozhe = response.xpath(‘//div[@id="smallcons"][1]/span[1]/text()‘).get() fenlei = response.xpath(‘//div[@id="smallcons"][1]/span[2]/a/text()‘).get() content_list = response.xpath(‘//div[@id="readerlist"]/ul/li‘) for li in content_list: book_list_url = li.xpath(‘./a/@href‘).get() book_list_url = response.urljoin(book_list_url) yield scrapy.Request(book_list_url, callback=self.book_content, meta={‘info‘:(name,zuozhe,fenlei)}) def book_content(self,response): name, zuozhe, fenlei,= response.meta.get(‘info‘) item = Biquge5200Item(name=name,zuozhe=zuozhe,fenlei=fenlei) item[‘title‘] = response.xpath(‘//div[@class="title"]/h1/text()‘).get() content = response.xpath(‘//div[@id="content"]//text()‘).getall() # 試試可不可以把 列表前兩個值不要 取[2:] content = list(map(lambda x:x.replace(‘\r\n‘,‘‘),content)) content = list(map(lambda x: x.replace(‘ads_yuedu_txt();‘, ‘‘), content)) item[‘content‘] = list(map(lambda x: x.replace(‘\xa0‘, ‘‘), content)) item[‘url‘] = response.url yield item
items.py
import scrapy class Biquge5200Item(scrapy.Item): name = scrapy.Field() zuozhe = scrapy.Field() fenlei = scrapy.Field() title = scrapy.Field() content = scrapy.Field() url = scrapy.Field()
middlewares.py
import user_agent class Biquge5200DownloaderMiddleware(object): def process_request(self, request, spider): request.headers[‘user-agent‘] = user_agent.generate_user_agent()
這是當初看視頻學到隨機useragent庫,但是忘記到底是怎麽導入的了....
由於網站沒有反爬,我只習慣性謝了個user-agent, 有需要你們到時候自己寫一個ua和ip的把..
Pipeline.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo class Biquge5200Pipeline(object): def open_spider(self,spider): self.client = pymongo.MongoClient() self.db = self.client.bqg def process_item(self, item, spider): name = item[‘name‘] zuozhe = item[‘zuozhe‘] fenlei = item[‘fenlei‘] coll = ‘ ‘.join([name,zuozhe,fenlei]) self.db[coll].insert({"_id":item[‘url‘], "title":item[‘title‘], "content":item[‘content‘]}) return item def close_spider(self, spider): self.client.close()
將獲取到的item中書名,作者,分類作為數據庫的集合名,將_id替換為item[‘url‘],之後可以用find().sort("_id":1)排序,默認存儲在本地的mongodb中,
windows端開啟mongodb,開啟方式--->>net start mongodb
linux端不太清楚,請百度
settings.py
BOT_NAME = ‘biquge5200‘ SPIDER_MODULES = [‘biquge5200.spiders‘] NEWSPIDER_MODULE = ‘biquge5200.spiders‘ DEFAULT_REQUEST_HEADERS = { ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, ‘Accept-Language‘: ‘en‘, } DOWNLOADER_MIDDLEWARES = { ‘biquge5200.middlewares.Biquge5200DownloaderMiddleware‘: 543, }
ITEM_PIPELINES = {
‘biquge5200.pipelines.Biquge5200Pipeline‘: 300,
}
完成...
如果嫌棄爬的慢,使用scrapy_redis分布式,在本機布置幾個分布式,適用於只有一臺電腦,我默認你安裝了scrapy_redis
現在settings.py中 添加幾個參數
#使用Scrapy-Redis的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" #利用Redis的集合實現去重 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #允許繼續爬取 SCHEDULER_PERSIST = True #設置優先級 SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.SpiderPriorityQueue‘ REDIS_HOST = ‘localhost‘ # ---------> 本機ip REDIS_PORT = 6379
在主程序中將以下代碼
class BqgSpider(CrawlSpider): name = ‘bqg‘ allowed_domains = [‘bqg5200.com‘] start_urls = [‘https://www.bqg5200.com/‘]
改為
from scrapy_redis.spiders import RedisCrawlSpider # -----> 導入 class BqgSpider(RedisCrawlSpider): # ------> 改變爬蟲父類 name = ‘bqg‘ allowed_domains = [‘bqg5200.com‘] # start_urls = [‘https://www.bqg5200.com/‘] redis_key = ‘bqg:start_urls‘ # ------> 記住這個redis終端有用,格式 一般寫爬蟲名:start_urls
開啟mongodb
開啟redis服務 ---->>> 進入redis安裝目錄 redis-server.exe redis.windows.conf
多開幾個cmd窗口進入爬蟲文件主程序文件中執行 scrapy runspider 爬蟲名 ,爬蟲進入監聽狀態
開啟reids終端 --->>> redis-cli.exe
輸入啟動啟動名稱和url,是你需要開始爬取的頁面
調試完成可以等待爬蟲爬取了
多臺主機爬取,需要看將那一臺主機作為主機端,將settings.py中REDIS_HOST改為主機端的ip
保存的數據存儲在哪也要考慮,如果直接保存在每臺爬蟲端,不需要要改動,如果想要匯總到一臺機器上,
在Pipeline.py中修改
mongoclient(host="匯總數據的ip",post="monodb默認端口")
將修改好的文件復制每臺爬蟲端開啟,匯總數據的電腦開啟mongodb ,主機端開啟redis服務,進入終端 輸入 lpush 爬蟲名:start_urls url
scrapycrawl 爬取筆趣閣小說