下載緩存
Python 緩存與持久化
緩存算是持久化的一個子集,但是緩存又有自己的過期策略和緩存級別,而持久化基本無過期策略之說。緩存與持久化並不是 Python 爬蟲特有的,其他語言都有涉及,所以我們下面既然說要把緩存和持久化放在一起說是建立在持久化緩存的基礎上,因為多級緩存策略的內存緩存等不在我們這篇的討論範疇
常見本地磁盤文件型
Python常見本地磁盤文件型數據持久化主要包括普通文件、DBM文件、Pickle序列化對象存儲、shelve鍵值序列化對象存儲,對於我們編寫爬蟲程序來說緩存的設計或者持久化方式我們可以自己依據自己的需求進行合適的評估選擇,下面給出常見的本地磁盤文件型持久化樣例
importdbm import pickle import shelve ‘‘‘ Python3 常用本地磁盤文件型持久化演示 ‘‘‘ class NormalFilePersistence(object): ‘‘‘ 普通文件持久化或者緩存持久化 ‘‘‘ def save(self, data): with open(‘NormalFilePersistence.txt‘, ‘w‘) as open_file: open_file.write(data) def load(self): with open(‘NormalFilePersistence.txt‘, ‘r‘) as open_file: return open_file.read() class DBMPersistence(object): ‘‘‘ DBM字符串鍵值對持久化或者緩存持久化 ‘‘‘ def save(self, key, value): try: dbm_file = dbm.open(‘DBMPersistence‘, ‘c‘) dbm_file[key] = str(value)finally: dbm_file.close() def load(self, key): try: dbm_file = dbm.open(‘DBMPersistence‘, ‘r‘) if key in dbm_file: result = dbm_file[key] else: result = None finally: dbm_file.close() return result class PicklePersistence(object): ‘‘‘ Pickle把復雜對象序列化到文件持久化或者緩存持久化 ‘‘‘ def save(self, obj): with open(‘PicklePersistence‘, ‘wb‘) as pickle_file: pickle.dump(obj, pickle_file) def load(self): with open(‘PicklePersistence‘, ‘rb‘) as pickle_file: return pickle.load(pickle_file) class ShelvePersistence(object): ‘‘‘ Shelve為DBM和Pickle的結合,以鍵值對的方式把復雜對象序列化到文件持久化或者緩存持久化 ‘‘‘ def save(self, key, obj): try: shelve_file = shelve.open(‘ShelvePersistence‘) shelve_file[key] = obj finally: shelve_file.close() def load(self, key): try: shelve_file = shelve.open(‘ShelvePersistence‘) if key in shelve_file: result = shelve_file[key] else: result = None finally: shelve_file.close() return result if __name__ == ‘__main__‘: t_normal = NormalFilePersistence() t_normal.save(‘Test NormalFilePersistence‘) print(‘NormalFilePersistence load: ‘ + t_normal.load()) t_dbm = DBMPersistence() t_dbm.save(‘user‘, ‘GJRS‘) t_dbm.save(‘age‘, 27) print(‘DBMPersistence load: ‘ + str(t_dbm.load(‘user‘))) print(‘DBMPersistence load: ‘ + str(t_dbm.load(‘address‘))) t_pickle = PicklePersistence() obj = {‘name‘: ‘GJRS‘, ‘age‘: 27, ‘skills‘:[‘Android‘, ‘C‘, ‘Python‘, ‘Web‘]} t_pickle.save(obj) print(‘PicklePersistence load: ‘ + str(t_pickle.load())) t_shelve = ShelvePersistence() obj1 = {‘name‘: ‘WL‘, ‘age‘: 27, ‘skills‘: [‘Test‘, ‘AutoTest‘]} obj2 = {‘name‘: ‘GJRS‘, ‘age‘: 27, ‘skills‘: [‘Android‘, ‘C‘, ‘Python‘, ‘Web‘]} t_shelve.save(‘obj1‘, obj1) t_shelve.save(‘obj2‘, obj2) print(‘ShelvePersistence load: ‘ + str(t_shelve.load(‘obj1‘))) print(‘ShelvePersistence load: ‘ + str(t_shelve.load(‘objn‘)))
磁盤緩存測試
目的:提高緩存效率,減少不必要的磁盤讀寫,減輕服務器壓力
方法:每次緩存前,檢測本地是否有緩存,數據是否發生更新,是否存在數據的添加刪除,如果數據沒有發生變化,就不再下載緩存,否則更新緩存數據
節省磁盤
對數據進行壓縮,可以減少磁盤的消耗,缺點:壓縮消耗一部分時間
在保存到磁盤之前使用壓縮即可
#zlib壓縮 fp.write(zlib.compress(pickle.drumps(result)) #加載的時候解壓即可 pickle.loads(zlib.decompress(fp.read())
常見數據庫方式
上面介紹了常見本地磁盤文件型的持久化,我們學習完一定會有疑惑,如果我的數據量巨大巨復雜怎麽辦,如果還是使用本地磁盤文件型的持久化那得多蛋疼啊,是的,所以我們現在來討論關於 Python 爬蟲的另一類緩存持久化方式 —— 數據庫持久化。
Sqlite 持久化
‘‘‘ Python3 sqlite3數據庫持久化演示 ‘‘‘ import sqlite3 class Sqlite3Persistence(object): def __init__(self): self.db = None def connect(self): try: self.db = sqlite3.connect("Sqlite3Persistence.db") sql_create_table = """CREATE TABLE IF NOT EXISTS `DemoTable` ( `id` INTEGER PRIMARY KEY AUTOINCREMENT, `name` CHAR(512) NOT NULL, `content` TEXT NOT NULL)""" self.db.execute(sql_create_table) except Exception as e: print("sqlite3 connect failed." + str(e)) def close(self): try: if self.db is not None: self.db.close() except BaseException as e: print("sqlite3 close failed."+str(e)) def insert_table_dict(self, dict_data=None): if dict_data is None: return False try: cols = ‘, ‘.join(dict_data.keys()) values = ‘"," ‘.join(dict_data.values()) sql_insert = "INSERT INTO `DemoTable`(%s) VALUES (%s)" % (cols, ‘"‘+values+‘"‘) self.db.execute(sql_insert) self.db.commit() except BaseException as e: self.db.rollback() print("sqlite3 insert error." + str(e)) return True def get_dict_by_name(self, name=None): if name is None: sql_select_table = "SELECT * FROM `DemoTable`" else: sql_select_table = "SELECT * FROM `DemoTable` WHERE name==%s" % (‘"‘+name+‘"‘) cursor = self.db.execute(sql_select_table) ret_list = list() for row in cursor: ret_list.append({‘id‘: row[0], ‘name‘: row[1], ‘content‘: row[2]}) return ret_list if __name__ == ‘__main__‘: t_sqlite3 = Sqlite3Persistence() t_sqlite3.connect() t_sqlite3.insert_table_dict({‘name‘: ‘Test1‘, ‘content‘: ‘XXXXXXXXXXXXX‘}) t_sqlite3.insert_table_dict({‘name‘: ‘Test2‘, ‘content‘: ‘vvvvvvvvvvvv‘}) t_sqlite3.insert_table_dict({‘name‘: ‘Test3‘, ‘content‘: ‘qqqqqqqqqqqq‘}) t_sqlite3.insert_table_dict({‘name‘: ‘Test4‘, ‘content‘: ‘wwwwwwwwwwwww‘}) print(‘Sqlite3Persistence get Test2: ‘ + str(t_sqlite3.get_dict_by_name(‘Test2‘))) print(‘Sqlite3Persistence get All: ‘ + str(t_sqlite3.get_dict_by_name()))
MySQL 持久化
‘‘‘ Python3 MySQL數據庫持久化演示 ‘‘‘ import pymysql class MySQLPersistence(object): def __init__(self): self.db = None self.cursor = None def connect(self): try: self.db = pymysql.connect("localhost", "yanbober", "TQJJtaJWNbGAMU44", "database_yan_php") self.db.set_charset(‘utf8‘) self.cursor = self.db.cursor() sql_create_table = """CREATE TABLE IF NOT EXISTS `StudentTable` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(512) COLLATE utf8_bin NOT NULL, `content` TEXT COLLATE utf8_bin NOT NULL, PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1""" self.cursor.execute(sql_create_table) except Exception as e: print("mysql connect failed." + str(e)) def close(self): try: if self.db is not None: self.db.close() if self.cursor is not None: self.cursor.close() except BaseException as e: print("mysql close failed."+str(e)) def insert_table_dict(self, dict_data=None): if self.db is None or self.cursor is None: print(‘Please ensure you have connected to mysql server!‘) return False if dict_data is None: return False try: cols = ‘, ‘.join(dict_data.keys()) values = ‘"," ‘.join(dict_data.values()) sql_insert = "INSERT INTO `StudentTable`(%s) VALUES (%s)" % (cols, ‘"‘+values+‘"‘) self.cursor.execute(sql_insert) self.db.commit() except BaseException as e: self.db.rollback() print("mysql insert error." + str(e)) return True def get_dict_by_name(self, name=None): if self.db is None or self.cursor is None: print(‘Please ensure you have connected to mysql server!‘) return None if name is None: sql_select_table = "SELECT * FROM `StudentTable`" else: sql_select_table = "SELECT * FROM `StudentTable` WHERE name=%s" % (‘"‘+name+‘"‘) self.cursor.execute(sql_select_table) ret_list = list() for item in self.cursor.fetchall(): ret_list.append({‘id‘: item[0], ‘name‘: item[1], ‘content‘: item[2]}) return ret_list if __name__ == ‘__main__‘: t_mysql = MySQLPersistence() t_mysql.connect() t_mysql.insert_table_dict({‘name‘: ‘Test1‘, ‘content‘: ‘XXXXXXXXXXXXX‘}) t_mysql.insert_table_dict({‘name‘: ‘Test2‘, ‘content‘: ‘vvvvvvvvvvvv‘}) t_mysql.insert_table_dict({‘name‘: ‘Test3‘, ‘content‘: ‘qqqqqqqqqqqq‘}) t_mysql.insert_table_dict({‘name‘: ‘Test4‘, ‘content‘: ‘wwwwwwwwwwwww‘}) print(‘MySQLPersistence get Test2: ‘ + str(t_mysql.get_dict_by_name(‘Test2‘))) print(‘MySQLPersistence get All: ‘ + str(t_mysql.get_dict_by_name())) t_mysql.close()
MongoDB 持久化
上面我們主要介紹了 python3.X 中關系型數據庫 mysql、sqlite 的使用,下面我們繼續介紹 Python3.X 爬蟲中常用的非關系型數據庫,先要介紹的是 MongoDB,它是一個基於分布式文件存儲的數據庫,是為 WEB 應用提供可擴展的高性能數據存儲而誕生的,是一個介於關系數據庫和非關系數據庫之間的東西,也是非關系數據庫中功能最豐富、最像關系數據庫的數據庫。
import pymongo ‘‘‘ Python3 MongoDB數據庫持久化演示 ‘‘‘ class MongoDBPersistence(object): def __init__(self): self.conn = None self.database = None def connect(self, database): try: self.conn = pymongo.MongoClient(‘mongodb://localhost:27017/‘) self.database = self.conn[database] except Exception as e: print("MongoDB connect failed." + str(e)) def close(self): try: if self.conn is not None: self.conn.close() except BaseException as e: print("MongoDB close failed."+str(e)) def insert_table_dict(self, dict_data=None): if self.conn is None or self.database is None: print(‘Please ensure you have connected to MongoDB server!‘) return False if dict_data is None: return False try: collection = self.database[‘DemoTable‘] collection.save(dict_data) except BaseException as e: print("MongoDB insert error." + str(e)) return True def get_dict_by_name(self, name=None): if self.conn is None or self.database is None: print(‘Please ensure you have connected to MongoDB server!‘) return None collection = self.database[‘DemoTable‘] if name is None: documents = collection.find() else: documents = collection.find({"name": name}) document_list = list() for document in documents: document_list.append(document) return document_list if __name__ == ‘__main__‘: t_mysql = MongoDBPersistence() t_mysql.connect("DemoDatabase") t_mysql.insert_table_dict({‘name‘: ‘Test1‘, ‘content‘: ‘XXXXXXXXXXXXX‘}) t_mysql.insert_table_dict({‘name‘: ‘Test2‘, ‘content‘: ‘vvvvvvvvvvvv‘}) t_mysql.insert_table_dict({‘name‘: ‘Test3‘, ‘content‘: ‘qqqqqqqqqqqq‘}) t_mysql.insert_table_dict({‘name‘: ‘Test4‘, ‘content‘: ‘wwwwwwwwwwwww‘}) print(‘MongoDBPersistence get Test2: ‘ + str(t_mysql.get_dict_by_name(‘Test2‘))) print(‘MongoDBPersistence get All: ‘ + str(t_mysql.get_dict_by_name())) t_mysql.close()
爬蟲持久化選型及總結
Tips:緩存持久化前我們可以對緩存比較大的文本數據先進行壓縮等處理再存儲,這樣可以節約存儲。
通過上面常見的 Python3.X 各種持久化方式介紹我們至少應該知道在爬蟲需要緩存持久化時我們可以有很多種選擇,至於如上所有持久化如何選型其實是依賴於我們自己爬蟲需求來決定的,不同的需求可能需要用不同的持久化類型,不過還是有一些參考策略來指導我們進行爬蟲持久化選型的,即我們需要認清上面那些持久化各自的優劣點。
對於本地文件型持久化其實優劣點是很明顯的,譬如上面介紹的有些支持序列化存儲,有些支持同一文件下多 key-value 對存儲,但是數據規模一旦龐大,本地文件存儲不僅效率低下,還容易出現數據故障,備份十分麻煩,總之只適用於輕量級本地單一數據格式存儲,也就是比較適合我們自己編寫的一些小爬蟲程序。
對於 Sqlite 數據庫存儲來說基本上只能認為是本地文件型存儲的一個關系型升級,有效的改善了本地磁盤文件存儲關系型數據的詬病,但是因為其為單機型迷你數據庫,在數據存儲量級和數據故障方面也是有瓶頸限制的,至於在本地文件型存儲和 Sqlite 的選型時我覺得重點要衡量爬蟲有用數據的關系,日後數據間關聯緊密,需要互相依賴查找的情況使用 Sqlite 似乎更勝一籌。
對於 MySQL 等關系型數據庫存儲和 MongoDB 等非關系型數據庫存儲的優劣比較其實在網上已經有很多文章談論多年了,不過在爬蟲時到底如何選擇其實還是取決於我們自己的需求定位,對於關系型數據庫存儲其具備高結構化數據、結構化查詢語言、數據和關系都存儲在單獨的表中,而對於非關系型數據庫存儲其具備高可用、高性能、高伸縮性、沒有聲明性查詢語言、使用鍵值對、列、文檔、圖形等存儲、存儲數據不可預知及無結構化可言。我們很多時候的爬蟲需求都是爬取某一垂直需求下的海量數據來進行建模數據分析的,對於這種情況其實更加適合使用 MongoDB 來進行爬蟲數據存儲;而又有些時候我們爬蟲數據可能具備高度的結構化封裝和關聯,我們想將爬取數據用來提供給其他平臺進行 API 接口訪問,在這種情況下似乎使用 MySQL 是一個不錯的選擇。
總之,Python3.X 爬蟲緩存與持久化選型是需要依據我們需求來決定的,甚至有些情況下可能會出現多種持久化組合使用的情況,我們需要做到的是掌握和知道爬蟲持久化可以有哪些選擇,只有這樣才能不變應萬變。
下載緩存