1. 程式人生 > >python 手動給requests模組新增urlretrieve下載檔案方法!

python 手動給requests模組新增urlretrieve下載檔案方法!

requests模組的前代是urllib模組,傳入引數headers、cookie、data什麼的肯定是requests好使,但是卻沒有urllib.request.urlretrieve這個方法,

urlretrieve(url, filename=None,reporthook=None, params=None,)

傳入url跟檔案路徑即可下載檔案,requests次次都得自己手動編寫,我覺得太麻煩了,而且它還有個回撥函式,我試著能不能把這個urlretrieve方法移植到requests模組來

要點:1.如何找到自己想要的python模組呢?cmd上打path,找出python的,然後CTRL+F

         2.下載檔案的實質是 contextlib.closing開啟網頁--->with open檔案--->寫入

     3.reporthook回撥函式的實質就是 把檔案一段段寫入檔案時把3個引數(每次寫入bytes的數量、次數、headers得到的總大小size)傳出去,讓回撥函式處理

    4.原來現在的模組,都是其他py檔案寫好方法,然後把其方法傳入__init__.py這個檔案的

    5.r.iter_content()的應用

進入urllib資料夾,在request檔案中找到urlretrieve方法,具體如下


def urlretrieve(url, filename
=None, reporthook=None, data=None): """ Retrieve a URL into a temporary location on disk. Requires a URL argument. If a filename is passed, it is used as the temporary file location. The reporthook argument should be a callable that accepts a block number, a read size, and the total file size of the URL target. The data argument should be
valid URL encoded data. If a filename is passed and the URL points to a local resource, the result is a copy from local file to new file. Returns a tuple containing the path to the newly created data file as well as the resulting HTTPMessage object. """ url_type, path = splittype(url) #分析網頁的,忽略 with contextlib.closing(urlopen(url, data)) as fp: #開啟網頁 headers = fp.info() #頭 # Just return the local path and the "headers" for file:// # URLs. No sense in performing a copy unless requested. if url_type == "file" and not filename: return os.path.normpath(path), headers #忽略 # Handle temporary file setup. if filename: tfp = open(filename, 'wb') #開啟檔案 else: tfp = tempfile.NamedTemporaryFile(delete=False)#忽略 filename = tfp.name _url_tempfiles.append(filename) with tfp: result = filename, headers bs = 1024*8 #每一次寫入bytes的大小size = -1 read = 0 blocknum = 0                             #寫入bytes的次數,2者相乘就是已經寫入的大小if "content-length" in headers: size = int(headers["Content-Length"]) #size就是檔案大小了 if reporthook: reporthook(blocknum, bs, size) #寫入前執行一次回撥函式 while True: block = fp.read(bs) if not block: break read += len(block) tfp.write(block) #寫入 blocknum += 1 if reporthook: reporthook(blocknum, bs, size) #每寫入一次就執行一次回撥函式 if size >= 0 and read < size: raise ContentTooShortError( "retrieval incomplete: got only %i out of %i bytes" % (read, size), result) return result

然而常規的requests模組下載檔案的寫法:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'}
with closing(requests.get(url=target, stream=True, headers=headers)) as r:
    with open('%d.jpg' % filename, 'ab+') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
f.write(chunk)
                f.flush()

現在把它給封裝起來:

def urlretrieve(url, filename=None,reporthook=None, params=None,):
'''傳入ID改變url,利用closingiter_content下載圖片'''
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'}
    with contextlib.closing(requests.get(url, stream=True, headers=headers,params=params)) as fp:#開啟網頁
header=fp.headers                                          #得出頭
with open(filename, 'wb+') as tfp:  #w是覆蓋原檔案,a是追加寫入#開啟檔案
bs = 1024
size = -1
blocknum = 0
if "content-length" in header:
size = int(header["Content-Length"])                #檔案的總大小理論值
if reporthook:
reporthook(blocknum, bs, size)                      #寫入前執行一次回撥函式
for chunk in fp.iter_content(chunk_size=1024):
                if chunk:
tfp.write(chunk)                                #寫入
tfp.flush()
                    blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)              #每寫入一次就執行一次回撥函式

測試:

def Schedule(a, b, c):
per = 100.0*a*b/c
if per > 100 :
per = 100
sys.stdout.write("  " + "%.2f%% 已經下載的大小:%ld 檔案大小:%ld" % (per,a*b,c) + '\r')
    sys.stdout.flush()

url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9'
urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)

OK,成功


現在把這個方法放進requests模組裡面,


先在requests資料夾裡把剛寫的方法放在api.py裡的最末

還有把import contextlib  寫上


然後在__init__.py這個檔案,api import後面加上urlretrieve


OK,可以直接執行

import requests,os,time,sys

def Schedule(a, b, c):
per = 100.0*a*b/c  #a是寫入次數,b是每次寫入bytes的數值,c是檔案總大小
if per > 100 :
per = 100
sys.stdout.write("  " + "%.2f%% 已經下載的大小:%ld 檔案大小:%ld" % (per,a*b,c) + '\r')
    sys.stdout.flush()

url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9'
requests.urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)