Python中 hash去重

阿新 • • 發佈：2018-11-11

現在有3000條資料，需要插入到資料庫中去，使用的是對連結進行MD5加密，

hashcode = md5(str(item_url))
然後在資料庫中設定 hashcode 為UNIQUE索引

3000條資料全部插入完畢，耗時是32s

不使用MD5加密，耗時30秒。（https://www.cnblogs.com/xuchunlin/p/8616604.html）

結論：MD5加密去重對時間影響不大

https://blog.csdn.net/Mao_code/article/details/53976511

https://blog.csdn.net/sangky/article/details/80931040

https://www.aliyun.com/jiaocheng/445004.html

https://www.cnblogs.com/renyuanjun/p/5562084.html

https://blog.csdn.net/katrina1rani/article/details/80907910

https://blog.csdn.net/yangczcsdn/article/details/81327091

https://blog.csdn.net/idkevin/article/details/47444237（Python中巧用set做去重）

http://outofmemory.cn/code-snippet/1191/Python-usage-hashlib-module-do-string-jiami

python中的hashlib和base64加密模組使用例項（https://www.jb51.net/article/54631.htm）

python學習之11 加密解密hashlib
hashlib是python專門用來加密解密的庫，有md5, sha1, sha224, sha256, sha384, sha512。
Python的hashlib提供了常見的摘要演算法，如MD5，SHA1等等。
什麼是摘要演算法呢？摘要演算法又稱雜湊演算法、雜湊演算法。它通過一個函式，把任意長度的資料轉換為一個長度固定的資料串（通常用16進位制的字串表示）。

（https://blog.csdn.net/lyffly2011/article/details/50733830

）

函式
用於計算使用者名稱和密碼相加得到的加密值。

def calc_md5(username, password):
md5 = hashlib.md5()
str_dd = username + password
md5.update(str_dd.encode('utf-8'))
return md5.hexdigest()
測試原始碼

import hashlib

test_string = '123456'

md5 = hashlib.md5()
md5.update(test_string.encode('utf-8'))
md5_encode = md5.hexdigest()
print(md5_encode)

sha1 = hashlib.sha1()
sha1.update(test_string.encode('utf-8'))
sha1_encode = sha1.hexdigest()
print(sha1_encode)

輸出結果為
e10adc3949ba59abbe56e057f20f883e
7c4a8d09ca3762af61e59520943dc26494f8941b

過程是先把檔案根據hash演算法轉為一個唯一的hash值再進行比較，可適用於圖片，txt檔案等比較

import hashlib
password = 'password'
#以md5方式加密
hash = hashlib.md5(b'j#$%^&;FD')
# hash = hashlib.md5('password')
hash.update(password.encode('utf-8'))
haword = hash.hexdigest()
print(haword)

import sys
import hashlib

def md5sum(filename):
    file_object = open(filename, 'rb')
    file_content = file_object.read()
    file_object.close()
    file_md5 = hashlib.md5(file_content)
    return file_md5

if __name__ == "__main__":
    hash_text = md5sum('tt.txt')
    print(hash_text.hexdigest())
    print(len(hash_text.hexdigest()))

注意，需要以二進位制的方式讀入檔案，若寫成hashlib.md5(filename)，則會變成對字串filename計算md5

另外對較大檔案進行校驗，一次性讀入太大內容，導致效能低下，故一般讀取部分進行處理。

# 大檔案的MD5值 def getFileMd5(self, filename):

if not os.path.isfile(filename):

return myhash = hashlib.md5()

f = file(filename, 'rb')

while True:

b = f.read(8096)

if not b:

break

myhash.update(b)

f.close()

return myhash.hexdigest()

Python中 hash去重

hashcode = md5(str(item_url))
然後在資料庫中設定 hashcode 為UNIQUE索引

3000條資料全部插入完畢，耗時是32s

結論：MD5加密去重對時間影響不大

python中的hashlib和base64加密模組使用例項（https://www.jb51.net/article/54631.htm）

（https://blog.csdn.net/lyffly2011/article/details/50733830

）

函式
用於計算使用者名稱和密碼相加得到的加密值。

Python中 hash去重

Python中的去重

【Python】Python中list去重的幾種方法

python中set去重注意事項

python中陣列去重

C#_ViewModel中通過 hash 去重

python List去重之set大法(表格轉化為str再hash去重) 和遍歷append大法

C#_ViewModel中通過 hash 去重

python中logging會重復寫日誌的問題分析

雙射 - hash去重

Python List資料去重和巢狀List資料去重

Python列表的去重方式

python pandas dataframe 去重函式

Python對list去重的各種方法

python scrapy d 去重

【Python】列表去重方法

python爬蟲url去重

Python List資料去重和巢狀List資料去重

Hive中的去重：distinct,group by與ROW_Number()視窗函式

漫談redis在運維資料分析中的去重統計方式

Python中 hash去重

hashcode = md5(str(item_url)) 然後在資料庫中設定 hashcode 為UNIQUE索引

3000條資料全部插入完畢，耗時是32s

結論：MD5加密去重對時間影響不大

python中的hashlib和base64加密模組使用例項（https://www.jb51.net/article/54631.htm）

（https://blog.csdn.net/lyffly2011/article/details/50733830 ）

函式 用於計算使用者名稱和密碼相加得到的加密值。

相關推薦

hashcode = md5(str(item_url))
然後在資料庫中設定 hashcode 為UNIQUE索引

（https://blog.csdn.net/lyffly2011/article/details/50733830

）

函式
用於計算使用者名稱和密碼相加得到的加密值。