python 下載檔案的多種方法彙總

阿新 • • 發佈：2020-11-18

本文件介紹了 Python 下載檔案的各種方式，從下載簡單的小檔案到用斷點續傳的方式下載大檔案。

Requests

使用 Requests 模組的 get 方法從一個 url 上下載檔案，在 python 爬蟲中經常使用它下載簡單的網頁內容

import requests

# 圖片來自bing.com
url = 'https://cn.bing.com/th?id=OHR.DerwentIsle_EN-CN8738104578_400x240.jpg'

def requests_download():

  content = requests.get(url).content

  with open('pic_requests.jpg','wb') as file:
    file.write(content)

urllib

使用 python 內建的 urllib 模組的 urlretrieve 方法直接將 url 請求儲存成檔案

from urllib import request

# 圖片來自bing.com
url = 'https://cn.bing.com/th?id=OHR.DerwentIsle_EN-CN8738104578_400x240.jpg'

def urllib_download():
  request.urlretrieve(url,'pic_urllib.jpg')

urllib3

urllib3 是一個用於 Http 客戶端的 Python 模組，它使用連線池對網路進行請求訪問

def urllib3_download():
  # 建立一個連線池
  poolManager = urllib3.PoolManager()

  resp = poolManager.request('GET',url)

  with open("pic_urllib3.jpg","wb") as file:
    file.write(resp.data)

  resp.release_conn()

wget

在 Linux 系統中有 wget 命令，可以方便的下載網上的資源，Python 中也有相應的 wget 模組。使用 pip install 命令安裝

import wget

# 圖片來自bing.com
url = 'https://cn.bing.com/th?id=OHR.DerwentIsle_EN-CN8738104578_400x240.jpg'

def wget_download():
  wget.download(url,out='pic_wget.jpg')

也可以直接在命令列中使用 wget 命令

python -m wget https://cn.bing.com/th?id=OHR.DerwentIsle_EN-CN8738104578_400x240.jpg

分塊下載大檔案

在需要下載的檔案非常大，電腦的記憶體空間完全不夠用的情況下，可以使用 requests 模組的流模式，預設情況下 stream 引數為 False,檔案過大會導致記憶體不足。stream 引數為 True 的時候 requests 並不會立刻開始下載，只有在呼叫 iter_content 或者 iter_lines 遍歷內容時下載

iter_content：一塊一塊的遍歷要下載的內容 iter_lines：一行一行的遍歷要下載的內容

import requests

def steam_download():
  # vscode 客戶端
  url = 'https://vscode.cdn.azure.cn/stable/e5a624b788d92b8d34d1392e4c4d9789406efe8f/VSCodeUserSetup-x64-1.51.1.exe'

  with requests.get(url,stream=True) as r:
    with open('vscode.exe','wb') as flie:
      # chunk_size 指定寫入大小每次寫入 1024 * 1024 bytes
      for chunk in r.iter_content(chunk_size=1024*1024):
        if chunk:
          flie.write(chunk)

進度條

在下載大檔案的時候加上進度條美化下載介面，可以實時知道下載的網路速度和已經下載的檔案大小，這裡使用 tqdm 模組作為進度條顯示，可以使用 pip install tqdm 安裝

from tqdm import tqdm

def tqdm_download():

  url = 'https://vscode.cdn.azure.cn/stable/e5a624b788d92b8d34d1392e4c4d9789406efe8f/VSCodeUserSetup-x64-1.51.1.exe'

  resp = requests.get(url,stream=True)

  # 獲取檔案大小
  file_size = int(resp.headers['content-length'])
  
  with tqdm(total=file_size,unit='B',unit_scale=True,unit_divisor=1024,ascii=True,desc='vscode.exe') as bar:
    with requests.get(url,stream=True) as r:
      with open('vscode.exe','wb') as fp:
        for chunk in r.iter_content(chunk_size=512):
          if chunk:
            fp.write(chunk)
            bar.update(len(chunk))

tqdm 引數說明：

total：bytes，整個檔案的大小
unit='B': 按 bytes 為單位計算
unit_scale=True：以 M 為單位顯示速度
unit_divisor=1024：檔案大小和速度按 1024 除以，預設時按 1000 來除
ascii=True：進度條的顯示符號，用於相容 windows 系統
desc='vscode.exe' 進度條前面的檔名

示例結果

python 下載檔案的多種方法彙總

斷點續傳

HTTP/1.1 在協議的請求頭中增加了一個名為 Range的欄位域， Range 欄位域讓檔案從已經下載的內容開始繼續下載

如果網站支援 Range 欄位域請求響應的狀態碼為 206(Partial Content)，否則為 416(Requested Range not satisfiable)

Range 的格式

Range:[unit=first byte pos] - [last byte pos]，即 Range = 開始位元組位置-結束位元組位置，單位：bytes

將 Range 加入到 headers 中

from tqdm import tqdm

def duan_download():
  url = 'https://vscode.cdn.azure.cn/stable/e5a624b788d92b8d34d1392e4c4d9789406efe8f/VSCodeUserSetup-x64-1.51.1.exe'

  r = requests.get(url,stream=True)

  # 獲取檔案大小
  file_size = int(r.headers['content-length'])

  file_name = 'vscode.exe'
  # 如果檔案存在獲取檔案大小，否在從 0 開始下載，
  first_byte = 0
  if os.path.exists(file_name):
    first_byte = os.path.getsize(file_name)
    
  # 判斷是否已經下載完成
  if first_byte >= file_size:
    return

  # Range 加入請求頭
  header = {"Range": f"bytes={first_byte}-{file_size}"}

  # 加了一個 initial 引數
  with tqdm(total=file_size,initial=first_byte,desc=file_name) as bar:
    # 加 headers 引數
    with requests.get(url,headers = header,stream=True) as r:
      with open(file_name,'ab') as fp:
        for chunk in r.iter_content(chunk_size=512):
          if chunk:
            fp.write(chunk)
            bar.update(len(chunk))

示例結果

啟動下載一段時間後，關閉指令碼重新執行，檔案在斷開的內容後繼續下載

python 下載檔案的多種方法彙總