爬蟲篇 6 手動請求傳送 五大核心元件 請求傳參 中介軟體初識
- 管道的持久化儲存:
- 資料解析(爬蟲類)
- 將解析的資料封裝到item型別的物件中(爬蟲類)
- 將item提交給管道:yield item(爬蟲類)
- 在官大類的process_item中接收item物件並且進行任意形式的持久化儲存操作(管道類)
- 在配置檔案中開啟管道
- 細節:
- 將爬取的資料進行備份?
- 一個管道類對應一種平臺的持久化儲存
- 有多個管道類是否意味著多個管道類都可以接受到爬蟲檔案提交的item?
-只有優先順序最高的管道才可以接受到item,剩下的管道類是需要從優先順序最高的管道類中接收item
- 基於Spider父類進行全站資料的爬取
- 全站資料的爬取:將所有頁碼對應的頁面資料進行爬取
- 手動請求的傳送(get):
yield scrapy.Request(url,callback)
- 對yield的總結:
- 向管道提交item的時候:yield item
- 手動請求傳送:yield scrapy.Request(url,callback)
- 手動發起post請求:
yield scrapy.FormRequest(url,formdata,callback):formdata是一個字典表示的是請求引數
虎牙全站爬取(多頁面)
定義一個
import scrapy from huyaAll.items import HuyaallItem class HuyaSpider(scrapy.Spider): name = 'huya' # allowed_domains = ['www.ccc.com'] start_urls = ['https://www.huya.com/g/xingxiu'] url = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&page=%sView Code' def parse(self, response): li_list = response.xpath('//*[@id="js-live-list"]/li') all_data = [] for li in li_list: title = li.xpath('./a[2]/text()').extract_first() # 去【】 author = li.xpath(' ./span/span[1]/i/text()').extract_first() hot = li.xpath('./ span / span[ 2] / i[2]/text()').extract_first() # print(title,author,hot) # dic = { # 'title':title, # 'author':author, # 'hot':hot # } # all_data.append(dic) # return all_data item = HuyaallItem() item['title'] = title item['author'] = author item['hot'] = hot yield item # 提交給管道 print(1) for page in range(2, 5): new_url = format(self.url % page) print(2) yield scrapy.Request(url=new_url, callback=self.parse_other) def parse_other(self, response): print(3) print(response.text) # 解析方法沒有寫
- scrapy五大核心元件
引擎(Scrapy)
用來處理整個系統的資料流處理, 觸發事務(框架核心)
排程器(Scheduler)
用來接受引擎發過來的請求, 壓入佇列中, 並在引擎再次請求的時候返回. 可以想像成一個URL(抓取網頁的網址或者說是連結)的優先佇列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
下載器(Downloader)
用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的非同步模型上的)
爬蟲(Spiders)
爬蟲是主要幹活的, 用於從特定的網頁中提取自己需要的資訊, 即所謂的實體(Item)。使用者也可以從中提取出連結,讓Scrapy繼續抓取下一個頁面
專案管道(Pipeline)
負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證實體的有效性、清除不需要的資訊。當頁面被爬蟲解析後,將被髮送到專案管道,並經過 幾個特定的次序處理資料。
- scrapy的請求傳參
- 作用:實現深度爬取。
- 使用場景:如果使用scrapy爬取的資料沒有存在同一張頁面中
- 傳遞item:yield scrapy.Request(url,callback,meta)
- 接收item:response.meta
import scrapy from moivePro.items import MoiveproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['http://www.4567kan.com/index.php/vod/show/class/喜劇/id/1.html'] url = 'http://www.4567kan.com/index.php/vod/show/class/喜劇/id/1/page/%s.html' page = 1 def parse(self, response): print('正在爬取第{}頁電影。。。'.format(self.page)) li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li') url_h = 'http://www.4567kan.com' for li in li_list: item = MoiveproItem() name = li.xpath('./div/div/h4/a').extract_first() item['name'] = name # 請求傳參:Requset將一個字典{meta=}傳遞給回撥函式 detail_url = url_h + li.xpath('./div/div/h4/a/@href').extract_first() yield scrapy.Request(detail_url, callback=self.parse_other, meta={'item': item}) if self.page < 5: self.page += 1 new_url = format(self.url % self.page) yield scrapy.Request(new_url, callback=self.parse) def parse_other(self, response): # 接收請求傳參的資料(字典) item = response.meta['item'] desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[3]/text()').extract_first() item['desc'] = desc yield item爬取全站電影名和簡介
- 提升scrapy爬取資料的效率
- 在配置檔案中進行相關的配置即可:
增加併發:
預設scrapy開啟的併發執行緒為32個,可以適當進行增加。在settings配置檔案中修改CONCURRENT_REQUESTS = 100值為100,併發設定成了為100。
降低日誌級別:
在執行scrapy時,會有大量日誌資訊的輸出,為了減少CPU的使用率。可以設定log輸出資訊為INFO或者ERROR即可。在配置檔案中編寫:LOG_LEVEL = ‘INFO’
禁止cookie:
如果不是真的需要cookie,則在scrapy爬取資料時可以禁止cookie從而減少CPU的使用率,提升爬取效率。在配置檔案中編寫:COOKIES_ENABLED = False
禁止重試:
對失敗的HTTP進行重新請求(重試)會減慢爬取速度,因此可以禁止重試。在配置檔案中編寫:RETRY_ENABLED = False
減少下載超時:
如果對一個非常慢的連結進行爬取,減少下載超時可以能讓卡住的連結快速被放棄,從而提升效率。在配置檔案中進行編寫:DOWNLOAD_TIMEOUT = 10 超時時間為10s
- scrapy的中介軟體
- 爬蟲中介軟體
- 下載中介軟體(***):處於引擎和下載器之間
- 作用:批量攔截所有的請求和響應
- 為什麼攔截請求
- 篡改請求的頭資訊(UA偽裝)
1、在設定裡面開啟下載中間鍵
2、不在設定裡面寫UA
3、在middlewares.py中重寫 ,class MiddleproDownloaderMiddleware:、
可以建立按一個UA池
# Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter import random # class MiddleproSpiderMiddleware: # # Not all methods need to be defined. If a method is not defined, # # scrapy acts as if the spider middleware does not modify the # # passed objects. # # @classmethod # def from_crawler(cls, crawler): # # This method is used by Scrapy to create your spiders. # s = cls() # crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) # return s # # def process_spider_input(self, response, spider): # # Called for each response that goes through the spider # # middleware and into the spider. # # # Should return None or raise an exception. # return None # # def process_spider_output(self, response, result, spider): # # Called with the results returned from the Spider, after # # it has processed the response. # # # Must return an iterable of Request, or item objects. # for i in result: # yield i # # def process_spider_exception(self, response, exception, spider): # # Called when a spider or process_spider_input() method # # (from other spider middleware) raises an exception. # # # Should return either None or an iterable of Request or item objects. # pass # # def process_start_requests(self, start_requests, spider): # # Called with the start requests of the spider, and works # # similarly to the process_spider_output() method, except # # that it doesn’t have a response associated. # # # Must return only requests (not items). # for r in start_requests: # yield r # # def spider_opened(self, spider): # spider.logger.info('Spider opened: %s' % spider.name) user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] class MiddleproDownloaderMiddleware: # 攔截請求 def process_request(self, request, spider): # 進行UA偽裝 request.headers['User-Agent'] = random.choice(user_agent_list) print(request.headers['User-Agent']) return None # 攔截所有的響應 def process_response(self, request, response, spider): return response # 攔截髮生異常的請求物件 def process_exception(self, request, exception, spider): pass # def spider_opened(self, spider): # spider.logger.info('Spider opened: %s' % spider.name)middlewares
- 修改請求對應的ip(代理)
# Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter import random # class MiddleproSpiderMiddleware: # # Not all methods need to be defined. If a method is not defined, # # scrapy acts as if the spider middleware does not modify the # # passed objects. # # @classmethod # def from_crawler(cls, crawler): # # This method is used by Scrapy to create your spiders. # s = cls() # crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) # return s # # def process_spider_input(self, response, spider): # # Called for each response that goes through the spider # # middleware and into the spider. # # # Should return None or raise an exception. # return None # # def process_spider_output(self, response, result, spider): # # Called with the results returned from the Spider, after # # it has processed the response. # # # Must return an iterable of Request, or item objects. # for i in result: # yield i # # def process_spider_exception(self, response, exception, spider): # # Called when a spider or process_spider_input() method # # (from other spider middleware) raises an exception. # # # Should return either None or an iterable of Request or item objects. # pass # # def process_start_requests(self, start_requests, spider): # # Called with the start requests of the spider, and works # # similarly to the process_spider_output() method, except # # that it doesn’t have a response associated. # # # Must return only requests (not items). # for r in start_requests: # yield r # # def spider_opened(self, spider): # spider.logger.info('Spider opened: %s' % spider.name) user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] class MiddleproDownloaderMiddleware: # 攔截請求 def process_request(self, request, spider): # 進行UA偽裝 request.headers['User-Agent'] = random.choice(user_agent_list) print(request.headers['User-Agent']) # 代理 request.meta['proxy'] = 'http://163.204.94.131:9999' print(request.meta['proxy']) return None # 攔截所有的響應 def process_response(self, request, response, spider): return response # 攔截髮生異常的請求物件 def process_exception(self, request, exception, spider): pass # def spider_opened(self, spider): # spider.logger.info('Spider opened: %s' % spider.name)proxy
- 為什麼攔截響應
- 篡改響應資料,篡改響應物件
- 爬取網易新聞的新聞標題和內容
- selenium在scrapy中的使用流程
- 在爬蟲類中定義一個bro的屬性,就是例項化的瀏覽器物件
- 在爬蟲類重寫父類的一個closed(self,spider),在方法中關閉bro
- 在中介軟體中進行瀏覽器自動化的操作
- 作業:
- 網易新聞
- http://sc.chinaz.com/tupian/xingganmeinvtupian.html網站中的圖片資料進行爬取