1. 程式人生 > >爬取二重網頁

爬取二重網頁

fin @class 爬取 self. tpi false ons type php

1.用 scrapy 新建一個 sun0769 項目

scrapy startproject sun0769

2.在 items.py 中確定要爬去的內容

 1 import scrapy
 2 
 3 
 4 class Sun0769Item(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     problem_type = scrapy.Field()
 8     title = scrapy.Field() 
 9     number = scrapy.Field() 
10 content = scrapy.Field() 11 Processing_status = scrapy.Field() 12 url = scrapy.Field()

3.快速創建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

註意 此時中的名稱不能與項目名相同

4.打開 dongguan.py 編寫代碼

 1 # -*- coding: utf-8 -*-
 2 # 導入scrapy 模塊
 3 import scrapy
 4 # 導入匹配規則類,用來提取符合規則的鏈接
5 from scrapy.linkextractors import LinkExtractor 6 # 導入CrawlSpiderl類和Rule 7 from scrapy.spiders import CrawlSpider, Rule 8 # 導入items中的類 9 from sun0769.items import Sun0769Item 10 11 class DongguanSpider(CrawlSpider): 12 name = dongguan 13 allowed_domains = [wz.sun0769.com] 14 start_urls = [
http://d.wz.sun0769.com/index.php/question/huiyin?page=30] 15 pagelink = LinkExtractor(allow=r"page=\d+") 16 pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml") 17 18 rules = ( 19 Rule(pagelink, follow=True ), 20 Rule(pagelink2, callback=parse_item,follow=True ), 21 22 ) 23 24 def parse_item(self, response): 25 #print response.url 26 item = Sun0769Item() 27 # xpath 返回是一個列表 28 #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract() 29 item[title] = response.xpath(//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()).extract()[0].split(" ")[-1].split(":")[-1] 30 # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0] 31 item[number] = response.xpath(//div[@class="pagecenter p3"]// strong[@class="tgray14"]/text()).extract()[0].split("")[1].split(" ")[0] 32 #item[‘content‘] = response.xpath().extract() 33 #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0] 34 # 把數據傳出去 35 yield item 36 37

5.在piplines.py寫代碼

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 import json
 9 
10 class TencentPipeline(object):
11     def open_spider(self, spider):
12         self.filename = open("dongguan.json", "w")
13 
14     def process_item(self, item, spider):
15         text = json.dumps(dict(item), ensure_ascii = False) + "\n"
16         self.filename.write(text.encode("utf-8")
17         return item
18 
19     def close_spider(self, spider):
20         self.filename.close()
復制代碼

6.在setting.py設置相關內容


問題:

1.怎麽把不同頁面的內容整合到一塊

2.內容匹配還有些困難(xpath,re)

爬取二重網頁