爬取二重網頁
阿新 • • 發佈:2017-10-06
fin @class 爬取 self. tpi false ons type php
1.用 scrapy 新建一個 sun0769 項目
scrapy startproject sun0769
2.在 items.py 中確定要爬去的內容
1 import scrapy 2 3 4 class Sun0769Item(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 problem_type = scrapy.Field() 8 title = scrapy.Field() 9 number = scrapy.Field()10 content = scrapy.Field() 11 Processing_status = scrapy.Field() 12 url = scrapy.Field()
3.快速創建 CrawlSpider模板
scrapy genspider -t crawl dongguan wz.sun0769.com
註意 此時中的名稱不能與項目名相同
4.打開 dongguan.py 編寫代碼
1 # -*- coding: utf-8 -*- 2 # 導入scrapy 模塊 3 import scrapy 4 # 導入匹配規則類,用來提取符合規則的鏈接5 from scrapy.linkextractors import LinkExtractor 6 # 導入CrawlSpiderl類和Rule 7 from scrapy.spiders import CrawlSpider, Rule 8 # 導入items中的類 9 from sun0769.items import Sun0769Item 10 11 class DongguanSpider(CrawlSpider): 12 name = ‘dongguan‘ 13 allowed_domains = [‘wz.sun0769.com‘] 14 start_urls = [‘http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘] 15 pagelink = LinkExtractor(allow=r"page=\d+") 16 pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml") 17 18 rules = ( 19 Rule(pagelink, follow=True ), 20 Rule(pagelink2, callback=‘parse_item‘,follow=True ), 21 22 ) 23 24 def parse_item(self, response): 25 #print response.url 26 item = Sun0769Item() 27 # xpath 返回是一個列表 28 #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract() 29 item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1] 30 # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0] 31 item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]// strong[@class="tgray14"]/text()‘).extract()[0].split(":")[1].split(" ")[0] 32 #item[‘content‘] = response.xpath().extract() 33 #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0] 34 # 把數據傳出去 35 yield item 36 37
5.在piplines.py寫代碼
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 import json 9 10 class TencentPipeline(object): 11 def open_spider(self, spider): 12 self.filename = open("dongguan.json", "w") 13 14 def process_item(self, item, spider): 15 text = json.dumps(dict(item), ensure_ascii = False) + "\n" 16 self.filename.write(text.encode("utf-8") 17 return item 18 19 def close_spider(self, spider): 20 self.filename.close() 復制代碼
6.在setting.py設置相關內容
問題:
1.怎麽把不同頁面的內容整合到一塊
2.內容匹配還有些困難(xpath,re)
爬取二重網頁