1. 程式人生 > 其它 >自動爬取網上免費代理實戰:檢測模組篇

自動爬取網上免費代理實戰:檢測模組篇

1.說明

  • 當我們從網上爬取代理下來時,比如:proxy = '185.78.228.24:8000',如何檢測它是否有效呢?
  • 測試一個代理是否可用的標準,在儲存模組篇就提到過了,這裡再簡單過一遍。如果一個代理可用,就給它設定一個分數,比如設定一個最高分,最高分標識該代理最可用的情況,反之最低分則標識該代理最不可用。在起初給代理設定一個預設分數,當測試一個代理可用時,把該代理設定成最高分,當迴圈測試完一遍代理後,發現不可用的代理,就把不可用的代理的分數-1,直至減至0分,0分可以表示為最低分最不可用,則把它移除掉。
  • 檢測原理也很簡單,拿到代理後,去訪問一些網站,如果正常返回,則說明代理可用性很大,為了進一步確保代理的可用性,我們可以使用二次檢測,就是拿代理去訪問網站兩次及以上,如果都返回正常,則認為代理是一個有效的代理。

2.實現

程式碼實現:

程式碼環境:Python 3.9.1, Redis:3.5.3

主模組

import asyncio
import sys
import aiohttp
import re
from pyquery import PyQuery as pq
from proxypool.storage.redisclient import redisClient
from aiohttp.client_exceptions import ClientConnectorError, ClientHttpProxyError, \
    ServerDisconnectedError, ClientOSError, ClientResponseError
from asyncio import TimeoutError
from proxypool.untils.parse import bytes_convert_string
from lxml.etree import ParserError, XMLSyntaxError
from proxypool.untils.loggings import Logging
from proxypool.setting import COUNT, REDIS_KEY, TEST_URL, TEST_URL_SWITCH,ip111
from requests.exceptions import ConnectionError
from urllib3.exceptions import MaxRetryError, NewConnectionError
from socket import gaierror


access_proxy = False  # 預設標記代理不可用
again_access_proxy = False  # 預設標記代理不可用

# 考慮測試代理多種異常情況,有相關異常類丟擲,則補上
Exceptions = ( 
    ClientConnectorError,
    ClientHttpProxyError,
    ClientOSError,
    ServerDisconnectedError,
    TimeoutError,
    ClientResponseError,
    AssertionError,
    ParserError,
    XMLSyntaxError,
    ConnectionError,
    MaxRetryError,
    NewConnectionError,
    gaierror,
)


class Tester(object):
    """
    測試代理池
    """

    def __init__(self):
        """
        初始化reids、Logging本地模組、asyncio事件迴圈
        """
        self.redis = redisClient()
        self.logger = Logging()
		
        # https://github.com/aio-libs/aiohttp/issues/4536#issuecomment
        if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'):
            asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
        self.loop = asyncio.get_event_loop()

    async def test(self, proxy):
        """
        測試每個代理是否可用的情況
        測試代理是否可用:
        1. 訪問https://www.httpbin.org/ip,根據響應內容,斷言代理前、代理後的ip是否一致,不一致則認為代理可用
        2. 是否啟用訪問ip111.cn網站測試(加入代理),自定義配置開關,ip111為True,則啟用;如訪問成功,access_proxy=True
        3. 是否啟用訪問TEST_URL網站測試(加入代理),自定義配置開關,TEST_URL_SWITCH為True,則啟動;並且需自定義配置TEST_URL,它是一個域名地址;如訪問成功,again_access_proxy=True
        """
        global access_proxy, again_access_proxy
        try:
            async with aiohttp.ClientSession() as session:
                if ip111:
                    # 測試代理是否可用,True if available, Interface from http://ip111.cn/
                    async with session.get(url='http://sspanel.net/ip.php', proxy=f'http://{proxy}') as response:
                        res_text = await response.text()
                        doc = pq(res_text).text()
                        match = re.search(r'(\d+\.\d+\.\d+\.\d+)', doc)
                        if match:
                            # access_proxy = match.group()
                            access_proxy = True

                # 獲取當前主機公網ip
                async with session.get(url='https://www.httpbin.org/ip') as response:
                    res_json = await response.json()
                    origin_ip = res_json.get('origin', None)

                # 測試proxy是否可用,可用則返回proxy
                async with session.get(url='https://www.httpbin.org/ip', proxy=f'http://{proxy}') as response:
                    res_json = await response.json()
                    _proxy_ip = res_json.get('origin', None)

                # 斷言判斷
                # For example:
                # proxy = 1.2.4.8:1080
                # _proxy_ip = https://www.httpbin.org/ip的響應1.2.4.8(代理)
                # origin_ip = https://www.httpbin.org/ip的響應8.8.8.8(不代理)
                # proxy_ip = 1.2.4.8
                assert origin_ip != _proxy_ip
                proxy_ip = proxy.split(":")[0]
                assert proxy_ip == _proxy_ip

                if TEST_URL_SWITCH:
                    # 測試proxy是否可訪問測試url
                    async with session.get(TEST_URL, proxy=f'http://{proxy}') as response:
                        if response.status == 200:
                            again_access_proxy = True

                if access_proxy or again_access_proxy:
                    # 設定最高分
                    self.redis.max(REDIS_KEY, proxy)
                else:
                    # 代理不可用則減分
                    self.redis.decrease(REDIS_KEY, proxy)
        except Exceptions:
            # 代理不可用則減分
            self.redis.decrease(REDIS_KEY, proxy)
            self.logger.debug(f'proxy {proxy} is invalid')

    @Logging.catch()
    def run(self):
        self.logger.info('starting tester......')
        count = self.redis.get_count(REDIS_KEY)
        self.logger.debug(f'{count} proxies to test')
        cursor = 0
        while True:
            self.logger.debug(f'Testing proxies use cursor {cursor}, count {COUNT}')
            #批量獲取代理
            cursor, proxies = self.redis.batch(REDIS_KEY, cursor, COUNT)
            if proxies:
                # 從redis取出來的value預設是python的bytes型別,bytes_convert_string方法轉換
                tasks = [self.test(bytes_convert_string(proxy[0])) for proxy in proxies] 
                self.loop.run_until_complete(asyncio.wait(tasks))
            if not cursor:
                break


def runtest():
    # 跳過redis測試單個代理
    proxy = '185.78.228.24:8000'
    tests = [tester.test(proxy)]
    tester.loop.run_until_complete(asyncio.wait(tests))

其它模組

# proxypool/untils/parse.py
# redis預設輸出的元素是bytes型別,這裡做轉換
def bytes_convert_string(data):
    if data is None:
        return None
    if isinstance(data, bytes):
        return data.decode('utf8')
# proxypool/untils/loggings.py
# 封裝一個日誌類,以便後續每個模組可直接呼叫它
from loguru import logger
import time
from pathlib import Path

class Logging():
    """
    日誌記錄
    """
    _instance = None
    t = time.strftime('%Y_%m_%d')
    dir = Path.cwd().joinpath('log')

    logger.add(f'{dir}/crawl_{t}.log',
               enqueue=True,
               rotation='00:00',
               retention='1 months',
               compression='tar.gz',
               encoding='utf-8',
               backtrace=True)

    def __new__(cls, *arg, **kwargs):
        if cls._instance is None:
            cls._instance = object.__new__(cls, *arg, **kwargs)
        return cls._instance

    def info(self, msg):
        return logger.info(msg)

    def debug(self, msg):
        return logger.debug(msg)

    def error(self, msg):
        return logger.error(msg)

    def exception(self, msg):
        return logger.exception(msg)

    @classmethod
    def catch(cls):
        return logger.catch

3. 總結

檢測模組原理比較簡單,當我們拿到一個代理後,想判斷其是否可用,可使用訪問網站方式去測試,比如直接訪問百度也是可以的,為了確保代理的可用性,建議測試不少於一遍。