python 手動給requests模組新增urlretrieve下載檔案方法!
阿新 • • 發佈:2019-01-05
requests模組的前代是urllib模組,傳入引數headers、cookie、data什麼的肯定是requests好使,但是卻沒有urllib.request.urlretrieve這個方法,
urlretrieve(url, filename=None,reporthook=None, params=None,)
傳入url跟檔案路徑即可下載檔案,requests次次都得自己手動編寫,我覺得太麻煩了,而且它還有個回撥函式,我試著能不能把這個urlretrieve方法移植到requests模組來
要點:1.如何找到自己想要的python模組呢?cmd上打path,找出python的,然後CTRL+F
2.下載檔案的實質是 contextlib.closing開啟網頁--->with open檔案--->寫入
3.reporthook回撥函式的實質就是 把檔案一段段寫入檔案時把3個引數(每次寫入bytes的數量、次數、headers得到的總大小size)傳出去,讓回撥函式處理
4.原來現在的模組,都是其他py檔案寫好方法,然後把其方法傳入__init__.py這個檔案的
5.r.iter_content()的應用
進入urllib資料夾,在request檔案中找到urlretrieve方法,具體如下
def urlretrieve(url, filename=None, reporthook=None, data=None): """ Retrieve a URL into a temporary location on disk. Requires a URL argument. If a filename is passed, it is used as the temporary file location. The reporthook argument should be a callable that accepts a block number, a read size, and the total file size of the URL target. The data argument should bevalid URL encoded data. If a filename is passed and the URL points to a local resource, the result is a copy from local file to new file. Returns a tuple containing the path to the newly created data file as well as the resulting HTTPMessage object. """ url_type, path = splittype(url) #分析網頁的,忽略 with contextlib.closing(urlopen(url, data)) as fp: #開啟網頁 headers = fp.info() #頭 # Just return the local path and the "headers" for file:// # URLs. No sense in performing a copy unless requested. if url_type == "file" and not filename: return os.path.normpath(path), headers #忽略 # Handle temporary file setup. if filename: tfp = open(filename, 'wb') #開啟檔案 else: tfp = tempfile.NamedTemporaryFile(delete=False)#忽略 filename = tfp.name _url_tempfiles.append(filename) with tfp: result = filename, headers bs = 1024*8 #每一次寫入bytes的大小size = -1 read = 0 blocknum = 0 #寫入bytes的次數,2者相乘就是已經寫入的大小if "content-length" in headers: size = int(headers["Content-Length"]) #size就是檔案大小了 if reporthook: reporthook(blocknum, bs, size) #寫入前執行一次回撥函式 while True: block = fp.read(bs) if not block: break read += len(block) tfp.write(block) #寫入 blocknum += 1 if reporthook: reporthook(blocknum, bs, size) #每寫入一次就執行一次回撥函式 if size >= 0 and read < size: raise ContentTooShortError( "retrieval incomplete: got only %i out of %i bytes" % (read, size), result) return result
然而常規的requests模組下載檔案的寫法:
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'} with closing(requests.get(url=target, stream=True, headers=headers)) as r: with open('%d.jpg' % filename, 'ab+') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) f.flush()
現在把它給封裝起來:
def urlretrieve(url, filename=None,reporthook=None, params=None,): '''傳入ID改變url,利用closing跟iter_content下載圖片''' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'} with contextlib.closing(requests.get(url, stream=True, headers=headers,params=params)) as fp:#開啟網頁 header=fp.headers #得出頭 with open(filename, 'wb+') as tfp: #w是覆蓋原檔案,a是追加寫入#開啟檔案 bs = 1024 size = -1 blocknum = 0 if "content-length" in header: size = int(header["Content-Length"]) #檔案的總大小理論值 if reporthook: reporthook(blocknum, bs, size) #寫入前執行一次回撥函式 for chunk in fp.iter_content(chunk_size=1024): if chunk: tfp.write(chunk) #寫入 tfp.flush() blocknum += 1 if reporthook: reporthook(blocknum, bs, size) #每寫入一次就執行一次回撥函式
測試:
def Schedule(a, b, c): per = 100.0*a*b/c if per > 100 : per = 100 sys.stdout.write(" " + "%.2f%% 已經下載的大小:%ld 檔案大小:%ld" % (per,a*b,c) + '\r') sys.stdout.flush() url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9' urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)
OK,成功
現在把這個方法放進requests模組裡面,
先在requests資料夾裡把剛寫的方法放在api.py裡的最末
還有把import contextlib 寫上
然後在__init__.py這個檔案,api import後面加上urlretrieve
OK,可以直接執行
import requests,os,time,sys def Schedule(a, b, c): per = 100.0*a*b/c #a是寫入次數,b是每次寫入bytes的數值,c是檔案總大小 if per > 100 : per = 100 sys.stdout.write(" " + "%.2f%% 已經下載的大小:%ld 檔案大小:%ld" % (per,a*b,c) + '\r') sys.stdout.flush() url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9' requests.urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)