並發體驗：Python抓圖的8種方式

阿新 • • 發佈：2018-06-16

splash 意圖 HR wrapper with os.path mon pri AD

本文是通過用爬蟲示例來說明並發相關的多線程、多進程、協程之間的執行效率對比。

技術分享圖片

假設我們現在要在網上下載圖片，一個簡單的方法是用 requests+BeautifulSoup。註：本文所有例子都使用python3.5）

單線程

示例 1：get_photos.py

import os
import time
import uuid

import requests
from bs4 import BeautifulSoup

def out_wrapper(func):  # 記錄程序執行時間的簡單裝飾器
    def inner_wrapper():
        start_time  
= time.time()
        func()
        stop_time = time.time()
        print(‘Used time {}‘.format(stop_time-start_time))
    return inner_wrapper

def save_flag(img, filename):  # 保存圖片
    path = os.path.join(‘down_photos‘, filename)
    with open(path, ‘wb‘) as fp:
        fp.write(img)

def download_one(url):  # 
 下載一個圖片
    image = requests.get(url)
    save_flag(image.content, str(uuid.uuid4()))

def user_conf():  # 返回30個圖片的url
    url = ‘https://unsplash.com/‘
    ret = requests.get(url)
    soup = BeautifulSoup(ret.text, "lxml")
    zzr = soup.find_all(‘img‘)
    ret = []
    num = 0
    for item in zzr:
         
if item.get("src").endswith(‘80‘) and num < 30:
            num += 1
            ret.append(item.get("src"))
    return ret

@out_wrapper
def download_many():
    zzr = user_conf()
    for item in zzr:
        download_one(item)

if __name__ == ‘__main__‘:
    download_many()

示例1進行的是順序下載，下載30張圖片的平均時間在60s左右（結果因實驗環境不同而不同）。

這個代碼能用但並不高效，怎麽才能提高效率呢？

參考開篇的示意圖，有三種方式：多進程、多線程和協程。下面我們一一說明：

我們都知道 Python 中存在 GIL（主要是Cpython），但 GIL 並不影響 IO 密集型任務，因此對於 IO 密集型任務而言，多線程更加適合（線程可以開100個，1000個而進程同時運行的數量受 CPU 核數的限制，開多了也沒用）

不過，這並不妨礙我們通過實驗來了解多進程。

多進程

示例2

from multiprocessing import Process
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    task_list = []
    for item in zzr:
        t = Process(target=download_one, args=(item,))
        t.start()
        task_list.append(t)
    [t.join() for t in task_list]  # 等待進程全部執行完畢（為了記錄時間）

if __name__ == ‘__main__‘:
    download_many()

本示例重用了示例1的部分代碼，我們只需關註使用多進程的這部分。

筆者測試了3次（使用的機器是雙核超線程，即同時只能有4個下載任務在進行），輸出分別是：19.5s、17.4s和18.6s。速度提升並不是很多，也證明了多進程不適合io密集型任務。

還有一種使用多進程的方法，那就是內置模塊futures中的ProcessPoolExecutor。

示例3

from concurrent import futures
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    with futures.ProcessPoolExecutor(len(zzr)) as executor:
        res = executor.map(download_one, zzr)
    return len(list(res))

if __name__ == ‘__main__‘:
    download_many()

使用 ProcessPoolExecutor 代碼簡潔了不少，executor.map 和標準庫中的 map用法類似。耗時和示例2相差無幾。多進程就到這裏，下面來體驗一下多線程。

多線程

示例4

import threading
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    task_list = []
    for item in zzr:
        t = threading.Thread(target=download_one, args=(item,))
        t.start()
        task_list.append(t)
    [t.join() for t in task_list]

if __name__ == ‘__main__‘:
    download_many()

threading 和 multiprocessing 的語法基本一樣，但是速度在9s左右，相較多進程提升了1倍。

下面的示例5和示例6中分別使用內置模塊 futures.ThreadPoolExecutor 中的 map 和submit、as_completed

示例5

from concurrent import futures
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    with futures.ThreadPoolExecutor(len(zzr)) as executor:
        res = executor.map(download_one, zzr)
    return len(list(res))

if __name__ == ‘__main__‘:
    download_many()

示例6：

from concurrent import futures
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    with futures.ThreadPoolExecutor(len(zzr)) as executor:
        to_do = [executor.submit(download_one, item) for item in zzr]
        ret = [future.result() for future in futures.as_completed(to_do)]
    return ret

if __name__ == ‘__main__‘:
    download_many()

Executor.map 由於和內置的map用法相似所以更易於使用，它有個特性：返回結果的順序與調用開始的順序一致。不過，通常更可取的方式是，不管提交的順序，只要有結果就獲取。

為此，要把 Executor.submit 和 futures.as_completed結合起來使用。

最後到了協程，這裏分別介紹 gevent 和 asyncio。

gevent

示例7

from gevent import monkey
monkey.patch_all()

import gevent
from get_photos import out_wrapper, download_one, user_conf

@out_wrapper
def download_many():
    zzr = user_conf()
    jobs = [gevent.spawn(download_one, item) for item in zzr]
    gevent.joinall(jobs)

if __name__ == ‘__main__‘:
    download_many()

asyncio

示例8

import uuid
import asyncio

import aiohttp
from get_photos import out_wrapper, user_conf, save_flag

async def download_one(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            save_flag(await resp.read(), str(uuid.uuid4()))

@out_wrapper
def download_many():
    urls = user_conf()
    loop = asyncio.get_event_loop()
    to_do = [download_one(url) for url in urls]
    wait_coro = asyncio.wait(to_do)
    res, _ = loop.run_until_complete(wait_coro)
    loop.close()
    return len(res)

if __name__ == ‘__main__‘:
    download_many()

協程的耗時和多線程相差不多，區別在於協程是單線程。具體原理限於篇幅這裏就不贅述了。

但是我們不得不說一下asyncio，asyncio是Python3.4加入標準庫的，在3.5為其添加async和await關鍵字。或許對於上述多線程多進程的例子你稍加研習就能掌握，但是想要理解asyncio你不得不付出更多的時間和精力。

另外，使用線程寫程序比較困難，因為調度程序任何時候都能中斷線程。必須保留鎖以保護程序，防止多步操作在執行的過程中中斷，防止數據處於無效狀態。

而協程默認會做好全方位保護，我們必須顯式產出才能讓程序的余下部分運行。對協程來說，無需保留鎖，在多個線程之間同步操作，協程自身就會同步，因為在任意時刻只有一個協程運行。想交出控制權時，可以使用 yield 或 yield from（await）把控制權交還調度程序。

總結

本篇文章主要是將python中並發相關的模塊進行基本用法的介紹，全做拋磚引玉。而這背後相關的進程、線程、協程、阻塞io、非阻塞io、同步io、異步io、事件驅動等概念和asyncio的用法並未介紹。大家感興趣的話可以自行google或者百度，也可以在下方留言，大家一起探討。

python學習交流群：125240963

作者：無名小妖

轉載至：https://blog.csdn.net/zV3e189oS5c0tSknrBCL/article/details/80681775

並發體驗：Python抓圖的8種方式

splash 意圖 HR wrapper with os.path mon pri AD 本文是通過用爬蟲示例來說明並發相關的多線程、多進程、協程之間的執行效率對比。假設我們現在要在網上下載圖片，一個簡單的方法是用 requests+BeautifulSoup。註：本文

並發體驗：Python抓圖的8種方式

單線程

多進程

多線程

gevent

asyncio

並發體驗：Python抓圖的8種方式

老男孩教育每日一題-第96天-網站並發知識點：pv-並發與架構設計基礎知識

.NET並行計算和並發5：多線程編程一般指導性原則

.NET並行計算和並發11：並發接口 IProducerConsumerCollection

並發實戰：多線程處理任務，結束後，執行後續操作

Go並發模式：管道與取消

java並發特性：原子性、可見性、有序性

Java程式設計體驗：執行緒的7種狀態及相互轉換(圖)

Python：python抓取豆瓣電影top250

【java基礎：JDBC】基於DBUtils工具類查詢資料庫的8種方式！非常重要！

[Python] 面向物件程式設計進階(一)：控制屬性的三種方式

並發編程實戰-保證線程安全方式

深入研究Spring-IoC ：容器建立的幾種方式

mybatis學習（5）：關聯查詢的幾種方式

linux下執行python指令碼的兩種方式

程序間通訊的8種方式

C#：引數傳遞的3種方式

JS 基礎篇(一)：建立物件的四種方式

陣列：合併陣列的兩種方式

面試題：清除浮動的三種方式及其原理

並發體驗：Python抓圖的8種方式

單線程

多進程

多線程

gevent

asyncio

相關推薦