利用scrapy爬取文件後並基於管道化的持久化存儲
阿新 • • 發佈:2019-05-10
val set field wid 參數 err spi http res
我們在pycharm上爬取
首先我們可以在本文件打開命令框或在Terminal下創建
scrapy startproject xiaohuaPro ------------創建文件
scrapy genspider xiaohua www.xxx.com ----------創建執行文件
一.首先我們要進行數據的爬取
import scrapy from xioahuaPro.items import XioahuaproItem class XiaohuaSpider(scrapy.Spider): name = ‘xiaohua‘ start_urls=[‘http://www.521609.com/daxuemeinv/‘] #生成一個通用的url模板 url = ‘http://www.521609.com/daxuemeinv/list8%d.html‘ pageNum =1 def parse(self, response): li_list=response.xpath(‘//div[@class="index_img list_center"]/ul/li‘) for li in li_list: name = li.xpath(‘./a[2]/text() | ./a[2]/b/text()‘).extract_first() img_url = ‘http://www.521609.com‘+li.xpath(‘./a[1]/img/@src‘).extract_first() #實例化一個item類型的對象 item = XioahuaproItem() item[‘name‘] = name item[‘img_url‘] = img_url #item提交給管道 yield item #對其他頁碼的url進行手動i請求的發送 if self.pageNum <= 24: ------爬取的頁數 self.pageNum += 1 new_url = format(self.url%self.pageNum) yield scrapy.Request(url=new_url,callback=self.parse)
之後再items.py文件下為item對象設置屬性
將爬取到的所有信息全部設置為item的屬性
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class XioahuaproItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() img_url = scrapy.Field()
二.寫入pipelines.py內容
首先寫入到自定義的文件裏去
作用:將解析到的數據存儲到某一個平臺中。 import pymysql from redis import Redis class XioahuaproPipeline(object): fp = None def open_spider(self,spider): print(‘開始爬蟲!‘) self.fp = open(‘./xiaohua.txt‘,‘w‘,encoding=‘utf-8‘) #作用:實現持久化存儲的操作 #該方法的item參數就可以接收爬蟲文件提交過來的item對象 #該方法每接收一個item就會被調用一次(調用多次) def process_item(self, item, spider): name = item[‘name‘] img_url = item[‘img_url‘] self.fp.write(name+‘:‘+img_url+‘\n‘) #返回值的作用:就是將item傳遞給下一個即將被執行的管道類 return item # def close_spider(self,spider): print(‘結束爬蟲!‘) self.fp.close() #
寫到數據庫裏面,我們要在數據庫裏面創建個表(將mysql和redis都啟動)
class MysqlPipeline(object): conn = None cursor = None def open_spider(self, spider): #解決數據庫字段無法存儲中文處理:alter table tableName convert to charset utf8; self.conn = pymysql.Connect(host=‘127.0.0.1‘,port=3306,user=‘root‘,password=‘123‘,db=‘test‘,charset=‘utf8‘) print(self.conn) def process_item(self, item, spider): self.cursor = self.conn.cursor() try: self.cursor.execute(‘insert into xiaohua values ("%s","%s")‘%(item[‘name‘],item[‘img_url‘])) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self, spider): self.cursor.close() self.conn.close()
在相同的文件下創建redis類寫入數據
class RedisPipeline(object): conn = None def open_spider(self, spider): self.conn = Redis(host=‘127.0.0.1‘,port=6379) print(self.conn) def process_item(self, item, spider): dic = { ‘name‘:item[‘name‘], ‘img_url‘:item[‘img_url‘] } print(str(dic)) self.conn.lpush(‘xiaohua‘,str(dic)) return item def close_spider(self, spider): pass
三.更改配置文件,在settings.py裏面
#添加上這行代碼
USER_AGENT = ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘ # Obey robots.txt rules ROBOTSTXT_OBEY = False -----改成False
ITEM_PIPELINES = { ‘xioahuaPro.pipelines.XioahuaproPipeline‘: 300, ---對應文件 # ‘xioahuaPro.pipelines.MysqlPipeline‘: 301, ----對應數據庫
# ‘xioahuaPro.pipelines.RedisPipeline‘: 302, -----對應redis } LOG_LEVEL = ‘ERROR‘
# CRITICAL --嚴重錯誤
#ERROR ---一般錯誤
#WARNING ---警告信息
#INFO ---一般信息
#DEBUG --調試信息
然後我們在終端去指定爬蟲程序
scrapy crawl 名字(name對應的值)
利用scrapy爬取文件後並基於管道化的持久化存儲