用 python 寫爬蟲爬取得資料儲存方式

阿新 • • 發佈：2018-12-21

mysql：

首先配置檔案：

ITEM_PIPELINES = { firstbloodpro.pipelines.MysqlproPipeline:300},配置好管道

第二配置好所需要的使用者名稱等

HOST='localhost'

POST=3306

USER='root'

PWD='123456'

DB='lala'

CHARSET = 'utf8'

管道中：

from scrapy.utils.project import get_project_settings

import pymysql

class MysqlproPipeline(object):

def open_spider(self,spider):

# setting 就是一個字典，字典的鍵值就是所有的配置選項

settings = get_project_settings()

self.db = pymysql.Connect(host = settings['HOST'],port = ['PORT'],user = ['USER'],pwd= ['PWD'],db = ['lala'],charset=['utf8'])

def close_spider(self,spider):

self.db.close()

def process_item(self,item,spider):

self.save_to_mysql(item)

retrun item

def save_to_mysql(self,item):

# 獲取cursor

cursor = self.db.cursor()

# 拼接sql語句

sql = 'insert into haha(face, name,age, content,haha_count, ping_count) values("%s","%s","%s","%s","%s","%s")' % (item['face'], item['name'], item['age'], item['content'], item['haha_count'], item['ping_count'])

# 執行sql語句

try :

cursor.execute(sql)

self.db.commit()

except Exception as e:

print (e)

self .db.rollback()

mongodb:.

1 配置檔案：

ITEM_PIPELINES = { firstbloodpro.pipelines.MongodbproPipeline:300},

2:管道檔案：

import pymongo

class MongodbproPiepeline(object):

def open_spider(self,spider):

self.client = pymongo.MongoClient(host = 'localhost',port=27017)

def close_spider(self,spider):

self.client.close()

def process_item(self,item,spider):

# 選擇資料庫

db.self.client.xxx

# 選擇集合

col = db.xxxx

#將item轉化為字典

dic = dict(item)

col.insert(dic)

return item

sqlite:

在管道檔案中

import sqlite3

class Sqlite3proPipeline(object):

def open_spider(self,spider):

self.db = sqlite3.connect(home.db)

self.cur = self.db.cursor()

def close_spider(self,spider):

self.db.close()

def process_item(self,item,spider):

self.save_to_sqlite(item)

return item

def save_to_sqlite(self,item):

sql = 'insert into dameo(city,title,rentway,price,housetype,area,address,traffic) values("%s","%s","%s","%s","%s","%s","%s","%s")' % (

item['city'], item['title'], item['rentway'], item['price'], item['housetype'], item['area'],item['address'], item['traffic'])'

try:

self.cur.execute(sql)

self.db.commit()

except Exception as e:

print(e)

self.db.rollback()

return item

在配置檔案中

ITEM_PIPELINES = { firstbloodpro.pipelines.Sqlite3proPipeline:300},

redis：

在配置檔案中：

將DOWNLOAD_DELAY = 3 下面的全部換成這個

# 指定使用scrapy-redis的排程器

SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 指定使用scrapy-redis的去重

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# 指定排序爬取地址時使用的佇列，

# 預設的按優先順序排序(Scrapy預設)，由sorted set實現的一種非FIFO、LIFO方式。

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'

# 可選的按先進先出排序（FIFO）

# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'

# 可選的按後進先出排序（LIFO）

# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderStack'

# 在redis中保持scrapy-redis用到的各個佇列，從而允許暫停和暫停後恢復，也就是不清理redis queues

SCHEDULER_PERSIST = True

# 只在使用SpiderQueue或者SpiderStack是有效的引數，指定爬蟲關閉的最大間隔時間

# SCHEDULER_IDLE_BEFORE_CLOSE = 10

# 通過配置RedisPipeline將item寫入key為 spider.name : items 的redis的list中，供後面的分散式處理item

# 這個已經由 scrapy-redis 實現，不需要我們寫程式碼

ITEM_PIPELINES = {

'posted.pipelines.PostedPipeline': 300,

'scrapy_redis.pipelines.RedisPipeline': 400

}

# 指定redis資料庫的連線引數

# REDIS_PASS是我自己加上的redis連線密碼（預設不做）

REDIS_HOST = '127.0.0.1'

REDIS_PORT = 6379

#REDIS_PASS = '[email protected]'

# LOG等級

LOG_LEVEL = 'DEBUG'

#預設情況下,RFPDupeFilter只記錄第一個重複請求。將DUPEFILTER_DEBUG設定為True會記錄所有重複的請求。

DUPEFILTER_DEBUG =True

# 覆蓋預設請求頭，可以自己編寫Downloader Middlewares設定代理和UserAgent

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Connection': 'keep-alive',

'Accept-Encoding': 'gzip, deflate, sdch'

}

用 python 寫爬蟲爬取得資料儲存方式

用 python 寫爬蟲爬取得資料儲存方式

WSWP（用python寫爬蟲）筆記二：實現連結獲取和資料儲存爬蟲

python：爬蟲爬取資料的處理之Json字串的處理（2）

用Python寫爬蟲（1）

學會用python網路爬蟲爬取鬥圖網的表情包，聊微信再也不怕鬥圖了

用 Python 寫爬蟲時應該注意哪些坑

Python中scrapy爬蟲框架的資料儲存方式（包含：圖片、檔案的下載）

用python寫爬蟲的一些技巧：進階篇

2018用Python寫網路爬蟲（視訊+原始碼+資料）

用python寫網路爬蟲-爬取新浪微博評論

入門級用Python寫一個簡單的網路爬蟲下載和獲取資料

用python寫一個豆瓣短評通用爬蟲(登入、爬取、視覺化)

《用Python寫網路爬蟲》第一章踩坑

用 Python 寫網路爬蟲第2版

python ：通過爬蟲爬取資料（1）

用python批量獲取某路徑資料夾及子資料夾下的指定型別檔案，並按原資料夾結構批量儲存處理後的檔案

Python網路爬蟲之股票資料Scrapy爬蟲例項介紹，實現與優化！（未成功生成要爬取的內容！）

用python寫：完成一個員工管理系統要求儲存員工的工號、姓名、年齡、性別、工資 1、員工錄入 2、查詢員工資訊 3、修改員工資訊 4、刪除 5、根據工號檢視 6、退出

python爬蟲並將資料儲存到MySQL或Excel中

用Python實現Flickr照片文字資料下載入庫及圖片儲存

用 python 寫爬蟲 爬取得資料儲存方式

相關推薦

用 python 寫爬蟲爬取得資料儲存方式