網路爬蟲之協程

阿新 • • 發佈：2021-07-11

一、協程的定義
協程又叫微執行緒，比執行緒還要小的一個單位；協程不是計算機提供的，是程式設計師自己創造出來的；協程是一個使用者態的上下文切換技術，簡單來說，就是通過一個執行緒去實現程式碼塊（函式）之間的相互切換執行。

二、協程的特點
1. 使用協程時不需要考慮全域性變數安全性的問題。
2. 協程必須要在單執行緒中實現併發。
3. 當協程遇到IO操作時，會自動切換到另一個協程中繼續執行。
4. 協程能夠完美解決IO密集型的問題，但是cpu密集型不是他的強項。
5. 協程執行效率非常高，因為協程的切換是子程式函式的切換，相比於執行緒的開銷來說要小很多，同時，執行緒越多開銷越大。

三、協程的原理

協程擁有自己的暫存器、上下文和棧，協程在排程切換函式時會將暫存器上下文和棧儲存到其它地方，再切回來的時候會恢復之前儲存的暫存器、上下文和棧繼續從上一次呼叫的狀態下繼續執行。

四、程序、執行緒和協程的對比
1. 協程既不是程序也不是執行緒，是一個特殊的函式，和程序、執行緒不是一個維度的。
2. 一個程序可以有多個執行緒，一個執行緒可以包含多個協程。
3. 一個執行緒內的多個協程可以相互切換，但是多個協程之間是串聯執行的，並且一個只能在一個執行緒內執行，所以，沒有辦法利用cpu的多核能力。

五、協程的實現
(1) 使用greenlet模組：最早實現協程的第三方模組
(2) yield關鍵字
(3) asyncio裝飾器(python直譯器版本3.4之後才有的)
(4) async、await關鍵字：非常好用，極力推薦(主要說這個實現方式)

六、案例介紹

1. 360圖片下載協程實現（非同步網路請求aiohttp介紹網址：https://www.cnblogs.com/fengting0913/p/14926893.html）

#async是用來定義好協程的，是定義的時候是用的，真正的呼叫使用的是await，
# 利用協程來下載圖片
async def download(url):
    # 建立session物件，其中async with 是一個整體，表示一個非同步的上下文管理器
    async with aiohttp.ClientSession() as session:
        # 發起請求與接收響應
        async with session.get(url=url) as response:
            content  
= await response.content.read()
            # 儲存圖片,注意本地的檔案讀寫不需要await
            imag_name = url.split('/')[-1]
            with open(imag_name,'wb') as fp:
                fp.write(content)

async def main():

    # 定義url_list
    url_list = [
        'https://img0.baidu.com/it/u=291378222,233871465&fm=26&fmt=auto&gp=0.jpg',
        'https://img2.baidu.com/it/u=3466049587,2049802835&fm=26&fmt=auto&gp=0.jpg',
        'https://img0.baidu.com/it/u=213410053,396892388&fm=26&fmt=auto&gp=0.jpg',
        'https://img0.baidu.com/it/u=1380950348,3018255149&fm=26&fmt=auto&gp=0.jpg',
        'https://img1.baidu.com/it/u=4110196045,3829597861&fm=26&fmt=auto&gp=0.jpg'
    ]
    # 建立tasks物件，建立協程，將協程封裝到Task物件中並新增到事件迴圈的任務列表中，等待事件迴圈去執行（預設是就緒狀態）
    tasks = [
        asyncio.ensure_future(download(i)) for i in url_list
    ]
    #將任務新增到事件迴圈中，等待協程的排程，await:當執行某協程遇到IO操作時，會自動化切換執行其他任務。
    await asyncio.wait(tasks)

if __name__ == '__main__':
    # 建立事件迴圈
    loop = asyncio.get_event_loop()
    # 將協程當做任務提交到事件迴圈的任務列表中，協程執行完成之後終止。
    loop.run_until_complete(main())

2. 小程式社群的title獲取：asyncio+aiohttp+aiomysql實現高併發爬蟲

思路：
    1. 請求小程式社群列表頁第一頁，獲取所有的詳情頁連結
    2. 進入到每一個文章的詳情頁中，獲取上一篇和下一篇的連結，並請求，重複這一步
    3. 獲取標題，存到MySQL中（使用MySQL的非同步連線池）

aiomysql非同步操作：https://www.yangyanxing.com/article/aiomysql_in_python.html

aiomysql使用介紹：https://www.cnblogs.com/zwb8848happy/p/8809861.html

from lxml import etree
import asyncio
import aiohttp
import aiomysql
import re

# 請求函式
async def get_request(url):
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url=url, headers=headers) as response:
                if response.status == 200:
                    content = await response.text()
                    return content
    except:
        pass

# 定義解析url的函式
# 注意：解析URL，不屬於io操作，是通過cpu完成的，所以，我們定義普通函式即可
def get_url(content):
    html = etree.HTML(content)
    a_href_list = html.xpath('//a')
    for a_href in a_href_list:
        # 注意xpath返回的是一個列表
        href = a_href.xpath('./@href')
        if href and url_pattern.findall(href[0]) and href[0] not in url_set:
            url_set.add(href[0])
            wait_url.append(href[0])

async def parse_article(url, pool):
    content = await get_request(url)
    try:
        get_url(content)
        html = etree.HTML(content)
        title = html.xpath('//h1/text()')[0]
        if not title:
            title = ''
        print(title)
        async with pool.acquire() as connect:
            async with connect.cursor() as cursor:
                insert_into = 'insert into titles(title)values (%s)'
                await cursor.execute(insert_into, title)
    except:
        pass

# 定義消費者函式
async def consumer(pool):
    while True:
        if len(wait_url) == 0:
            await asyncio.sleep(0.5)
            continue
        url = wait_url.pop()
        asyncio.ensure_future(parse_article(url, pool))
    pass

async def main():
    """
    1、建立mysql非同步連線池，需要注意的是 aiomysql 是基於協程的，因此需要通過 await 的方式來呼叫。
    2、使用連線池的意義在於，有一個池子，它裡保持著指定數量的可用連線，當一個查詢結執行之前從這個池子裡取一個連線，
    查詢結束以後將連線放回池子中，這樣可以避免頻繁的連線資料庫，節省大量的資源。
    3、高併發情況下，非同步連線池可以顯著提升總體讀寫的效率，這是單連線無法比擬的。
    """
    pool = await aiomysql.create_pool(
        host='127.0.0.1',
        port=3306,
        user='root',
        password='123456',
        db='mina',
        charset='utf8',
        autocommit=True
    )
    content = await get_request(start_url)
    # 將start_url放置到url_et集合中
    url_set.add(start_url)
    get_url(content)
    await asyncio.ensure_future(consumer(pool))

if __name__ == '__main__':
    start_url = 'https://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1'
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
                 Chrome/80.0.3987.163 Safari/537.36'
    }
    # 定義去重集合
    url_set = set()
    # 定義待獲取的url列表
    wait_url = []
    # 匹配路由的正則
    url_pattern = re.compile(r'article-\d+-\d+\.html')
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

天青色等煙雨而我在等你！

網路爬蟲之協程

網路爬蟲之協程

python多工之協程的使用詳解

Python非同步程式設計之協程任務的排程操作例項分析

06.Python網路爬蟲之requests模組（2）

05.Python網路爬蟲之三種資料解析方式

04.Python網路爬蟲之requests模組（1）

08.Python網路爬蟲之圖片懶載入技術、selenium和PhantomJS

16.Python網路爬蟲之Scrapy框架（CrawlSpider）

13.併發程式設計之協程

python進階九——併發程式設計之協程

併發程式設計之協程

詳解python之協程gevent模組

Python網路爬蟲之requests模組2

Python網路爬蟲之requests模組1

GO語言學習筆記之協程和管道

學習Kotlin之協程入門（二）

python爬蟲-非同步協程

python網路併發之執行緒,協程

Python大資料之網路爬蟲的post請求、get請求區別例項分析

Python協程操作之gevent(yield阻塞，greenlet)，協程實現多工(有規律的交替協作執行)用法詳解

網路爬蟲之協程

相關推薦