解讀scrapy框架
阿新 • • 發佈:2019-01-06
article head import segment tex .get all 停止 deferred
scrapy框架基礎:Twsited
scrapy內部基於事件循環的機制實現爬蟲的並發。
原來:
url_list = [‘http://www.baidu.com‘,‘http://www.baidu.com‘,‘http://www.baidu.com‘,] for item in url_list: response = requests.get(item) print(response.text)原來執行多個請求任務
現在:
from twisted.web.client importtwistedgetPage, defer from twisted.internet import reactor # 第一部分:代理開始接收任務 def callback(contents): print(contents) deferred_list = [] # [(龍泰,貝貝),(劉淞,寶件套),(呼呼,東北)] url_list = [‘http://www.bing.com‘, ‘https://segmentfault.com/‘,‘https://stackoverflow.com/‘ ] for url in url_list: deferred = getPage(bytes(url, encoding=‘utf8‘)) # (我,要誰) deferred.addCallback(callback) deferred_list.append(deferred) # # 第二部分:代理執行完任務後,停止 dlist = defer.DeferredList(deferred_list) def all_done(arg): reactor.stop() dlist.addBoth(all_done) # 第三部分:代理開始去處理吧 reactor.run()
什麽是twisted?
- 官方:基於事件循環的異步非阻塞模塊。
- 白話:一個線程同時可以向多個目標發起Http請求。
非阻塞:不等待,所有請求同時發出。 我向請求A、請求B、請求C發起連接請求的時候,不等連接返回結果之後再去連下一個,而是發送一個之後,馬上發送下一個。
import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.1,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.2,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.3,80))socket非阻塞
異步:回調。我一旦幫助callback_A、callback_B、callback_F找到想要的A,B,C,我會主動通知他們。
def callback(contents): print(contents)callback
事件循環: 我,我一直在循環三個socket任務(即:請求A、請求B、請求C),檢查他三個狀態:是否連接成功;是否返回結果。
scrapy
一.命令
scrapy startproject xx # 創建項目
cd xx # 進入項目目錄
scrapy genspider chouti chouti.com # 創建spider
“““編寫爬蟲”””
scrapy crawl chouti --nolog # 開啟爬蟲
二. 編寫
def parse(self,response):
1. 響應:
# 1.響應 # response封裝了響應相關的所有數據: - response.text - response.encoding - response.body
- response.meta[‘depth‘:‘深度‘] - response.request # 當前響應是由那個請求發起;請求中 封裝(要訪問的url,下載完成之後執行那個函數)
2. 解析
response.css(‘...‘) 返回一個response xpath對象
response.css(‘....‘).extract() 返回一個列表
response.css(‘....‘).extract_first() 提取列表中的元素
def parse_detail(self, response): # items = JobboleArticleItem() # title = response.xpath(‘//div[@class="entry-header"]/h1/text()‘)[0].extract() # create_date = response.xpath(‘//p[@class="entry-meta-hide-on-mobile"]/text()‘).extract()[0].strip().replace(‘·‘,‘‘).strip() # praise_nums = int(response.xpath("//span[contains(@class,‘vote-post-up‘)]/h10/text()").extract_first()) # fav_nums = response.xpath("//span[contains(@class,‘bookmark-btn‘)]/text()").extract_first() # try: # if re.match(‘.*?(\d+).*‘, fav_nums).group(1): # fav_nums = int(re.match(‘.*?(\d+).*‘, fav_nums).group(1)) # else: # fav_nums = 0 # except: # fav_nums = 0 # comment_nums = response.xpath(‘//a[contains(@href,"#article-comment")]/span/text()‘).extract()[0] # try: # if re.match(‘.*?(\d+).*‘,comment_nums).group(1): # comment_nums = int(re.match(‘.*?(\d+).*‘,comment_nums).group(1)) # else: # comment_nums = 0 # except: # comment_nums = 0 # contente = response.xpath(‘//div[@class="entry"]‘).extract()[0] # tag_list = response.xpath(‘//p[@class="entry-meta-hide-on-mobile"]/a/text()‘).extract() # tag_list = [tag for tag in tag_list if not tag.strip().endswith(‘評論‘)] # tags = ",".join(tag_list) # items[‘title‘] = title # try: # create_date = datetime.datetime.strptime(create_date,‘%Y/%m/%d‘).date() # except: # create_date = datetime.datetime.now() # items[‘date‘] = create_date # items[‘url‘] = response.url # items[‘url_object_id‘] = get_md5(response.url) # items[‘img_url‘] = [img_url] # items[‘praise_nums‘] = praise_nums # items[‘fav_nums‘] = fav_nums # items[‘comment_nums‘] = comment_nums # items[‘content‘] = contente # items[‘tags‘] = tagsxpath解析jobble
# title = response.css(‘.entry-header h1::text‘)[0].extract() # create_date = response.css(‘p.entry-meta-hide-on-mobile::text‘).extract()[0].strip().replace(‘·‘,‘‘).strip() # praise_nums = int(response.css(".vote-post-up h10::text").extract_first() # fav_nums = response.css(".bookmark-btn::text").extract_first() # if re.match(‘.*?(\d+).*‘, fav_nums).group(1): # fav_nums = int(re.match(‘.*?(\d+).*‘, fav_nums).group(1)) # else: # fav_nums = 0 # comment_nums = response.css(‘a[href="#article-comment"] span::text‘).extract()[0] # if re.match(‘.*?(\d+).*‘, comment_nums).group(1): # comment_nums = int(re.match(‘.*?(\d+).*‘, comment_nums).group(1)) # else: # comment_nums = 0 # content = response.css(‘.entry‘).extract()[0] # tag_list = response.css(‘p.entry-meta-hide-on-mobile a::text‘) # tag_list = [tag for tag in tag_list if not tag.strip().endswith(‘評論‘)] # tags = ",".join(tag_list) # xpath選擇器 /@href /text()css解析jobbole
def parse_detail(self, response): img_url = response.meta.get(‘img_url‘,‘‘) item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response) item_loader.add_css("title", ".entry-header h1::text") item_loader.add_value(‘url‘,response.url) item_loader.add_value(‘url_object_id‘, get_md5(response.url)) item_loader.add_css(‘date‘, ‘p.entry-meta-hide-on-mobile::text‘) item_loader.add_value("img_url", [img_url]) item_loader.add_css("praise_nums", ".vote-post-up h10::text") item_loader.add_css("fav_nums", ".bookmark-btn::text") item_loader.add_css("comment_nums", "a[href=‘#article-comment‘] span::text") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") item_loader.add_css("content", "div.entry") items = item_loader.load_item() yield itemsitem_loader版本
3. 再次發起請求
yield Request(url=‘xxxx‘,callback=self.parse)
yield Request(url=parse.urljoin(response.url,post_url), meta={‘img_url‘:img_url}, callback=self.parse_detail)
解讀scrapy框架