(3).遞歸獲取所有頁碼
阿新 • • 發佈:2018-07-03
一個 name date ges 不想 則表達式 字符串 limit 我們
# -*- coding: utf-8 -*- import scrapy class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] def parse(self, response): # 在子子孫孫中找到所有id="dig_lcpage"的div標簽 # 在對應的div標簽中找到所有的a標簽 # 獲取所有對應a標簽的href屬性 # 加上extract()獲取字符串 res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: print(url) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/2 ‘‘‘ # 會發現這裏有重復的,因為我們起始是第一頁,每次總共分十頁。那麽下一頁指的就是第二頁 # 所以會發現第二頁重復的href重復了 # 可以定義一個集合 urls = set() for url in res: if url in urls: print(f"{url}--此url已存在") else: urls.add(url) print(url) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/2--此url已存在 ‘‘‘
# -*- coding: utf-8 -*- import scrapy class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] def parse(self, response): # 上面是直接將url進行比較,但是一般情況下我們不直接比較url # url我們可能會放在緩存裏,或者放在數據庫裏 # 如果url很長,會占用空間,因此我們會進行一個加密,比較加密之後的結果 res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() # 也可以直接找到所有想要的a標簽 ‘‘‘ 找到a標簽,什麽樣的a標簽,以"/all/hot/recent/"開頭的a標簽 res = response.xpath(‘//a[starts-with(@href, "/all/hot/recent/")]/@href‘).extract() 也可以通過正則表達式來找到a標簽,re:test是固定寫法 res = response.xpath(‘//a[re:test(@href, "/all/hot/recent/\d+")]/@href‘).extract() ‘‘‘ md5_urls = set() for url in res: md5_url = self.md5(url) if md5_url in md5_urls: print(f"{url}--此url已存在") else: md5_urls.add(md5_url) print(url) def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] # 當遞歸查找時,會反復執行parse,因此md5_urls不能定義在parse函數裏面 md5_urls = set() def parse(self, response): res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: md5_url = self.md5(url) if md5_url in self.md5_urls: pass else: print(url) self.md5_urls.add(md5_url) # 將新的要訪問的url放到調度器 url = "https://dig.chouti.com%s" % url yield Request(url=url, callback=self.parse) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/1 /all/hot/recent/11 /all/hot/recent/12 ........ ........ ........ /all/hot/recent/115 /all/hot/recent/116 /all/hot/recent/117 /all/hot/recent/118 /all/hot/recent/119 /all/hot/recent/120 ‘‘‘ def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
可以看到,spider將所有的頁碼全都找出來了,但我不想它把全部頁碼都找出來,因此可以指定爬取的深度
在settings裏面加上DEPTH_LIMIT=2,表示只爬取兩個深度,即當前十頁完成之後再往後爬取兩個深度。
如果DEPTH_LIMIT<0,那麽只爬取一個深度,等於0,全部爬取,大於0,按照指定值爬取相應的深度
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request class GetChoutiSpider(scrapy.Spider): name = ‘get_chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] # 當遞歸查找時,會反復執行parse,因此md5_urls不能定義在parse函數裏面 md5_urls = set() def parse(self, response): res = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for url in res: md5_url = self.md5(url) if md5_url in self.md5_urls: pass else: print(url) self.md5_urls.add(md5_url) # 將新的要訪問的url放到調度器 url = "https://dig.chouti.com%s" % url yield Request(url=url, callback=self.parse) ‘‘‘ /all/hot/recent/2 /all/hot/recent/3 /all/hot/recent/4 /all/hot/recent/5 /all/hot/recent/6 /all/hot/recent/7 /all/hot/recent/8 /all/hot/recent/9 /all/hot/recent/10 /all/hot/recent/1 /all/hot/recent/11 /all/hot/recent/12 /all/hot/recent/13 /all/hot/recent/14 /all/hot/recent/15 /all/hot/recent/16 /all/hot/recent/17 /all/hot/recent/18 ‘‘‘ def md5(self, url): import hashlib m = hashlib.md5() m.update(bytes(url, encoding="utf-8")) return m.hexdigest()
因此在當前十頁爬取完畢之後,再往下一個深度,是十四頁,再往下一個深度是十八頁
(3).遞歸獲取所有頁碼