如何提升爬蟲性能相關的知識點

阿新 • • 發佈：2018-01-23

adp 高性能 pen 多線程模型非阻塞組合 lis ddc 高效

如何提升爬蟲性能相關的知識點

　　爬蟲的本質是偽造socket客戶端與服務端的通信過程，如果我們有多個url待爬取，只用一個線程且采用串行的方式執行，那只能等待爬取一個url結束後才能繼續下一個，這樣我們就會發現效率非常低。

　　原因：爬蟲是一項IO密集型任務，遇到IO問題就會阻塞，CPU運行就會停滯，直到阻塞結束。那麽在CPU等待組合結束的過程中，任務其實是呈現出卡住的狀態。但是，如果在單線程下進行N個任務且都是純計算的任務的話，那麽該線程對cpu的利用率仍然會很高，所以單線程下串行多個計算密集型任務效率不會比並發低，但要是IO密集型任務就會顯得非常低效。關於IO模型詳見鏈接：http://www.cnblogs.com/linhaifeng/articles/7454717.html

　　提高爬蟲高效率的方法就是：

同步、異步、回調機制

　　同步調用即提交一個任務後就在原地等待任務結束，等到拿到任務的結果後再繼續下一行代碼，效率低下。

示例代碼：

import requests

def parse_page(res):
    print(‘解析 %s‘ %(len(res)))

def get_page(url):
    print(‘下載 %s‘ %url)
    response=requests.get(url)
    if response.status_code == 200:
        return response.text

urls 
=[‘https://www.baidu.com/‘,‘http://www.sina.com.cn/‘,‘https://www.python.org‘]
for url in urls:
    res=get_page(url)
    parse_page(res) #調用一個任務，就在原地等待任務結束拿到結果後才進入下一次循環

同步調用

　　針對上述同步調用，可在服務端開啟多線程或多進程來解決，這樣各自遇到IO阻塞都不會影響到彼此。但是這樣也是存在問題的，一個服務端終究是一臺電腦，受硬件限制無法做到無限制開線程或進程，在遇到同時相應成千上萬的路由請求時（如12306，學校內網搶課），這個方案會嚴重占用系統資源，降低響應效率，線程和進程容易進入假死狀態，用戶看到的就是網站卡爆了無法訪問。。

#IO密集型程序應該用多線程
import requests
from threading import Thread,current_thread

def parse_page(res):
    print(‘%s 解析 %s‘ %(current_thread().getName(),len(res)))

def get_page(url,callback=parse_page):
    print(‘%s 下載 %s‘ %(current_thread().getName(),url))
    response=requests.get(url)
    if response.status_code == 200:
        callback(response.text)

if __name__ == ‘__main__‘:
    urls=[‘https://www.baidu.com/‘,‘http://www.sina.com.cn/‘,‘https://www.python.org‘]
    for url in urls:
        t=Thread(target=get_page,args=(url,))
        t.start()

多進程或多線程

多線程示例代碼

線程池或進程池+異步調用

　　使用“線程池”或“連接池”是一個解決思路。“線程池”旨在減少創建和銷毀線程的頻率，其維持一定合理數量的線程，並讓空閑的線程重新承擔新的執行任務。“連接池”維持連接的緩存池，盡量重用已有的連接、減少創建和關閉連接的頻率。這兩種技術都可以很好的降低系統開銷，都被廣泛應用很多大型系統，如websphere、tomcat和各種數據庫等。

　　但是“線程池”和“連接池”技術也只是在一定程度上緩解了頻繁調用IO接口帶來的資源占用。而且，所謂“池”始終有其上限，當請求大大超過上限時，“池”構成的系統對外界的響應並不比沒有池的時候效果好多少。所以使用“池”必須考慮其面臨的響應規模，並根據響應規模調整“池”的大小。面對大規模的服務請求，多線程模型也會遇到瓶頸，可以用非阻塞接口可以相對有效的解決這個問題。

#IO密集型程序應該用多線程，所以此時我們使用線程池
import requests
from threading import current_thread
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor

def parse_page(res):
    res=res.result()
    print(‘%s 解析 %s‘ %(current_thread().getName(),len(res)))

def get_page(url):
    print(‘%s 下載 %s‘ %(current_thread().getName(),url))
    response=requests.get(url)
    if response.status_code == 200:
        return response.text

if __name__ == ‘__main__‘:
    urls=[‘https://www.baidu.com/‘,‘http://www.sina.com.cn/‘,‘https://www.python.org‘]

    pool=ThreadPoolExecutor(50)
    # pool=ProcessPoolExecutor(50)
    for url in urls:
        pool.submit(get_page,url).add_done_callback(parse_page)

    pool.shutdown(wait=True)

進程池或線程池：異步調用+回調機制

使用了線程池異步調用+回調機制的示例代碼

如何提高性能

　　綜上所述，解決IO阻塞問題便是提高爬蟲性能的終極目標。但是IO是無法避免的，IO的時間也是與電腦硬件相關的，程序根本無法做到優化。怎麽辦呢？解決這一問題的關鍵在於，我們自己從應用程序級別檢測IO阻塞，然後在檢測到IO阻塞發生時立刻將CPU切換到我們自己程序的其他任務執行，這樣把我們程序的阻塞時間降到最低，處於就緒態的程序就會增多，以此來迷惑操作系統，操作系統便以為我們的程序是IO比較少的程序，從而會盡可能多的分配CPU給我們，從而達到了提升程序執行效率的目的。

asyncio模塊

　　在python3.3之後新增了asyncio模塊，可以幫我們檢測IO（只能是網絡IO），實現應用程序級別的切換

基本使用方法：

import asyncio#222

@asyncio.coroutine
def task(task_id,senconds):
    print(‘%s is start‘ %task_id)
    yield from asyncio.sleep(senconds) #只能檢測網絡IO,檢測到IO後切換到其他任務執行
    print(‘%s is end‘ %task_id)

tasks=[task(task_id="任務1",senconds=3),task("任務2",2),task(task_id="任務3",senconds=1)]

loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

註：asyncio模塊只能發tcp級別的請求，不能發http協議，因此，在我們需要發送http請求的時候，需要我們自定義http報頭

import asyncio
import requests
import uuid
user_agent=‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0‘

def parse_page(host,res):
    print(‘%s 解析結果 %s‘ %(host,len(res)))
    with open(‘%s.html‘ %(uuid.uuid1()),‘wb‘) as f:
        f.write(res)

@asyncio.coroutine
def get_page(host,port=80,url=‘/‘,callback=parse_page,ssl=False):
    print(‘下載 http://%s:%s%s‘ %(host,port,url))

    #步驟一（IO阻塞）：發起tcp鏈接，是阻塞操作，因此需要yield from
    if ssl:
        port=443
    recv,send=yield from asyncio.open_connection(host=host,port=443,ssl=ssl)

    # 步驟二：封裝http協議的報頭，因為asyncio模塊只能封裝並發送tcp包，因此這一步需要我們自己封裝http協議的包
    request_headers="""GET %s HTTP/1.0\r\nHost: %s\r\nUser-agent: %s\r\n\r\n""" %(url,host,user_agent)
    # requset_headers="""POST %s HTTP/1.0\r\nHost: %s\r\n\r\nname=egon&password=123""" % (url, host,)
    request_headers=request_headers.encode(‘utf-8‘)

    # 步驟三（IO阻塞）：發送http請求包
    send.write(request_headers)
    yield from send.drain()

    # 步驟四（IO阻塞）：接收響應頭
    while True:
        line=yield from recv.readline()
        if line == b‘\r\n‘:
            break
        print(‘%s Response headers：%s‘ %(host,line))

    # 步驟五（IO阻塞）：接收響應體
    text=yield from recv.read()

    # 步驟六：執行回調函數
    callback(host,text)

    # 步驟七：關閉套接字
    send.close() #沒有recv.close()方法，因為是四次揮手斷鏈接，雙向鏈接的兩端，一端發完數據後執行send.close()另外一端就被動地斷開


if __name__ == ‘__main__‘:
    tasks=[
        get_page(‘www.baidu.com‘,url=‘/s?wd=美女‘,ssl=True),
        get_page(‘www.cnblogs.com‘,url=‘/‘,ssl=True),
    ]

    loop=asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()

asyncio+自定義http協議頭

aiohttp模塊

　　自定義http報頭多少有點麻煩，aiohttp模塊專門幫我們封裝了http頭，我們用asyncio模塊和aiohttp模塊就可以

import aiohttp
import asyncio#pip3 install aiohttp


@asyncio.coroutine
def get_page(url):
    print(‘GET:%s‘ %url)
    response=yield from aiohttp.request(‘GET‘,url)

    data=yield from response.read()

    print(url,data)
    response.close()
    return 1

tasks=[
    get_page(‘https://www.python.org/doc‘),
    get_page(‘https://www.cnblogs.com/linhaifeng‘),
    get_page(‘https://www.openstack.org‘)
]

loop=asyncio.get_event_loop()
results=loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

print(‘=====>‘,results) #[1, 1, 1]

aiohttp模塊+asyncio模塊技術分享圖片

import requests
import asyncio

@asyncio.coroutine
def get_page(func,*args):
    print(‘GET:%s‘ %args[0])
    loog=asyncio.get_event_loop()
    furture=loop.run_in_executor(None,func,*args)
    response=yield from furture

    print(response.url,len(response.text))
    return 1

tasks=[
    get_page(requests.get,‘https://www.python.org/doc‘),
    get_page(requests.get,‘https://www.cnblogs.com/linhaifeng‘),
    get_page(requests.get,‘https://www.openstack.org‘)
]

loop=asyncio.get_event_loop()
results=loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

print(‘=====>‘,results) #[1, 1, 1]

優化代碼

grequests模塊

　　grequests模塊是封裝了協程gevent和requests模塊的模塊，可以利用協程來提高性能。

#pip3 install grequests

import grequests

request_list=[
    grequests.get(‘https://wwww.xxxx.org/doc1‘),
    grequests.get(‘https://www.cnblogs.com/linhaifeng‘),
    grequests.get(‘https://www.openstack.org‘)
]


##### 執行並獲取響應列表 #####
# response_list = grequests.map(request_list)
# print(response_list)

##### 執行並獲取響應列表（處理異常） #####
def exception_handler(request, exception):
    # print(request,exception)
    print("%s Request failed" %request.url)

response_list = grequests.map(request_list, exception_handler=exception_handler)
print(response_list)

grequests模塊內示例代碼

以上都是底層原理，各自身懷絕技的小嘍嘍，將他們的絕招都偷來以後就有了下面兩個強大的框架

twisted框架和tornado框架

用twisted框架和tornado框架會自動處理異步和回調，我們要做的就只是發起請求就好了

twisted框架：是一個網絡框架，其中封裝了一個發送異步請求，檢測IO並自動切換的小功能（人家內功深厚，絕技眾多，還沒用力上面的小嘍嘍就多倒下了）

安裝：

#訪問https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下載Twisted-17.9.0-cp36-cp36m-win_amd64.whl
#下載完後放入C盤下（哪兒都可以）
pip3 install C:\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
pip3 install twisted
pip3 install pyopenssl

基本使用示例：

from twisted.web.client import getPage,defer
from twisted.internet import reactor

def all_done(arg):
    # print(arg)
    reactor.stop()

def callback(res):
    print(res)
    return 1

defer_list=[]
urls=[
    ‘http://www.baidu.com‘,
    ‘http://www.bing.com‘,
    ‘https://www.python.org‘,
]
for url in urls:
    obj=getPage(url.encode(‘utf=-8‘),)
    obj.addCallback(callback)
    defer_list.append(obj)

defer.DeferredList(defer_list).addBoth(all_done)

reactor.run()




#twisted的getPage的詳細用法
from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse


def one_done(arg):
    print(arg)
    reactor.stop()

post_data = urllib.parse.urlencode({‘check_data‘: ‘adf‘})
post_data = bytes(post_data, encoding=‘utf8‘)
headers = {b‘Content-Type‘: b‘application/x-www-form-urlencoded‘}
response = getPage(bytes(‘http://dig.chouti.com/login‘, encoding=‘utf8‘),
                   method=bytes(‘POST‘, encoding=‘utf8‘),
                   postdata=post_data,
                   cookies={},
                   headers=headers)
response.addBoth(one_done)

reactor.run()

twisted基本用法

tornado框架：

　　tornado是一個應用非常大的框架

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop

n=0#計數器

def handle_response(response):
    global n
    """
    處理返回值內容（需要維護計數器，來停止IO循環），調用 ioloop.IOLoop.current().stop()
    :param response:
    :return:
    """
    try:
        if response.error:
            print("Error:", response.error)
        else:
            raise TypeError
            print(response.body)
    finally:
        n-=1#完成一個任務，計數器減一
        if n == 0:#任務結束，程序終止
            ioloop.IOLoop.current().stop()

def func():
    global n
    url_list = [
        ‘http://www.baidu.com‘,
        ‘http://www.cnblogs.com‘,
    ]
    for url in url_list:
        print(url)
        http_client = AsyncHTTPClient()
        http_client.fetch(HTTPRequest(url), handle_response)
        n+=1#任務數加一


ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

tornado基本用法

終極大招

　　上面第一部分講的是解決問題的思路，第二部分基礎的原理，第三部分是兩個武林高手。但是他們都不是我們在工作中會使用的方式，真正爬蟲時我們會用一個專門針對爬蟲封裝好所有功能的絕世高手——Scrapy框架。

　　詳情請見鏈接：

如何提升爬蟲性能相關的知識點

adp 高性能 pen 多線程模型非阻塞組合 lis ddc 高效如何提升爬蟲性能相關的知識點　　爬蟲的本質是偽造socket客戶端與服務端的通信過程，如果我們有多個url待爬取，只用一個線程且采用串行的方式執行，那只能等待爬取一個url結束後才能繼續下一個，這樣我

如何提升爬蟲性能相關的知識點