1. 程式人生 > 實用技巧 >對微博進行爬蟲的時候,一定要注意一下訪問頻率

對微博進行爬蟲的時候,一定要注意一下訪問頻率

基本測試指令碼(python):

import time,requests

def test_ip_freq(freq):
    if freq==0:
        return
    #測試1分鐘
    delay=1/freq
    t0=time.time()
    requests_num=0
    status="success"
    while 1:
        r = requests.get("https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D61%26q%3D%E7%96%AB%E6%83%85%26t%3D0&page_type=searchall&page=2")
        if r.status_code != 200:
            status='fail'
            break
        requests_num+=1
        if time.time()-t0>5*60:
            break
        time.sleep(delay)
    print("當前的訪問頻率是{0}/s,狀態:{1},請求總數{2},耗時{3}s, 實際頻率{4}".format(freq,status,requests_num,time.time()-t0,requests_num/(time.time()-t0)))
    return status
for i in [0.3,0.35,0.4,0.45,0.5]:
    status=test_ip_freq(i)
    if status=='fail':
        break
#統計ip被封的時間
t0=time.time()
while 1:
    r = requests.get("https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D61%26q%3D%E7%96%AB%E6%83%85%26t%3D0&page_type=searchall&page=2")
    if r.status_code == 200:
        break
    time.sleep(10)
print("ip被封的時間是{0}s".format(time.time()-t0))

測試結果:

當前的訪問頻率是0.3/s,狀態:success,請求總數81,耗時303.2352440357208s, 實際頻率0.2671193457659502
當前的訪問頻率是0.35/s,狀態:success,請求總數91,耗時302.8865134716034s, 實際頻率0.30044256166107425
當前的訪問頻率是0.4/s,狀態:fail,請求總數53,耗時164.40774130821228s, 實際頻率0.3223692484202544
ip被封的時間是183s

https代理推薦:

芝麻代理:http://h.zhimaruanjian.com/