1. 程式人生 > >使用scrapy進行股票數據爬蟲

使用scrapy進行股票數據爬蟲

key proc txt 框架 mage 技術分享 date star self.

周末了解了scrapy框架,對上次使用requests+bs4+re進行股票爬蟲(http://www.cnblogs.com/wyfighting/p/7497985.html)的代碼,使用scrapy進行了重寫。

目錄結構:

技術分享

stocks.py文件代碼

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4  
 5  
 6 class StocksSpider(scrapy.Spider):
 7     name = "stocks"
 8     start_urls = [http://quote.eastmoney.com/stocklist.html
] 9 10 def parse(self, response): 11 for href in response.css(a::attr(href)).extract(): 12 try: 13 stock = re.findall(r"[s][hz]\d{6}", href)[0] 14 url = https://gupiao.baidu.com/stock/ + stock + .html 15 yield scrapy.Request(url, callback=self.parse_stock)
16 except: 17 continue 18 19 def parse_stock(self, response): 20 infoDict = {} 21 stockInfo = response.css(.stock-bets) 22 name = stockInfo.css(.bets-name).extract()[0] 23 keyList = stockInfo.css(dt).extract() 24 valueList = stockInfo.css(
dd).extract() 25 for i in range(len(keyList)): 26 key = re.findall(r>.*</dt>, keyList[i])[0][1:-5] 27 try: 28 val = re.findall(r\d+\.?.*</dd>, valueList[i])[0][0:-5] 29 except: 30 val = -- 31 infoDict[key]=val 32 33 infoDict.update( 34 {股票名稱: re.findall(\s.*\(,name)[0].split()[0] + 35 re.findall(\>.*\<, name)[0][1:-1]}) 36 yield infoDict

pipelines.py文件代碼:

 1 # -*- coding: utf-8 -*-
 2  
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7  
 8  
 9 class BaidustocksPipeline(object):
10     def process_item(self, item, spider):
11         return item
12  
13 class BaidustocksInfoPipeline(object):
14     def open_spider(self, spider):
15         self.f = open(BaiduStockInfo.txt, w)
16  
17     def close_spider(self, spider):
18         self.f.close()
19  
20     def process_item(self, item, spider):
21         try:
22             line = str(dict(item)) + \n
23             self.f.write(line)
24         except:
25             pass
26         return item

settings.py文件中被修改的區域:

1 # Configure item pipelines
2 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
3 ITEM_PIPELINES = {
4     BaiduStocks.pipelines.BaidustocksInfoPipeline: 300,
5 }

使用scrapy進行股票數據爬蟲