自動爬取網上免費代理實戰:檢測模組篇
阿新 • • 發佈:2021-07-22
1.說明
- 當我們從網上爬取代理下來時,比如:proxy = '185.78.228.24:8000',如何檢測它是否有效呢?
- 測試一個代理是否可用的標準,在儲存模組篇就提到過了,這裡再簡單過一遍。如果一個代理可用,就給它設定一個分數,比如設定一個最高分,最高分標識該代理最可用的情況,反之最低分則標識該代理最不可用。在起初給代理設定一個預設分數,當測試一個代理可用時,把該代理設定成最高分,當迴圈測試完一遍代理後,發現不可用的代理,就把不可用的代理的分數-1,直至減至0分,0分可以表示為最低分最不可用,則把它移除掉。
- 檢測原理也很簡單,拿到代理後,去訪問一些網站,如果正常返回,則說明代理可用性很大,為了進一步確保代理的可用性,我們可以使用二次檢測,就是拿代理去訪問網站兩次及以上,如果都返回正常,則認為代理是一個有效的代理。
2.實現
程式碼實現:
程式碼環境:Python 3.9.1, Redis:3.5.3
主模組
import asyncio import sys import aiohttp import re from pyquery import PyQuery as pq from proxypool.storage.redisclient import redisClient from aiohttp.client_exceptions import ClientConnectorError, ClientHttpProxyError, \ ServerDisconnectedError, ClientOSError, ClientResponseError from asyncio import TimeoutError from proxypool.untils.parse import bytes_convert_string from lxml.etree import ParserError, XMLSyntaxError from proxypool.untils.loggings import Logging from proxypool.setting import COUNT, REDIS_KEY, TEST_URL, TEST_URL_SWITCH,ip111 from requests.exceptions import ConnectionError from urllib3.exceptions import MaxRetryError, NewConnectionError from socket import gaierror access_proxy = False # 預設標記代理不可用 again_access_proxy = False # 預設標記代理不可用 # 考慮測試代理多種異常情況,有相關異常類丟擲,則補上 Exceptions = ( ClientConnectorError, ClientHttpProxyError, ClientOSError, ServerDisconnectedError, TimeoutError, ClientResponseError, AssertionError, ParserError, XMLSyntaxError, ConnectionError, MaxRetryError, NewConnectionError, gaierror, ) class Tester(object): """ 測試代理池 """ def __init__(self): """ 初始化reids、Logging本地模組、asyncio事件迴圈 """ self.redis = redisClient() self.logger = Logging() # https://github.com/aio-libs/aiohttp/issues/4536#issuecomment if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'): asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) self.loop = asyncio.get_event_loop() async def test(self, proxy): """ 測試每個代理是否可用的情況 測試代理是否可用: 1. 訪問https://www.httpbin.org/ip,根據響應內容,斷言代理前、代理後的ip是否一致,不一致則認為代理可用 2. 是否啟用訪問ip111.cn網站測試(加入代理),自定義配置開關,ip111為True,則啟用;如訪問成功,access_proxy=True 3. 是否啟用訪問TEST_URL網站測試(加入代理),自定義配置開關,TEST_URL_SWITCH為True,則啟動;並且需自定義配置TEST_URL,它是一個域名地址;如訪問成功,again_access_proxy=True """ global access_proxy, again_access_proxy try: async with aiohttp.ClientSession() as session: if ip111: # 測試代理是否可用,True if available, Interface from http://ip111.cn/ async with session.get(url='http://sspanel.net/ip.php', proxy=f'http://{proxy}') as response: res_text = await response.text() doc = pq(res_text).text() match = re.search(r'(\d+\.\d+\.\d+\.\d+)', doc) if match: # access_proxy = match.group() access_proxy = True # 獲取當前主機公網ip async with session.get(url='https://www.httpbin.org/ip') as response: res_json = await response.json() origin_ip = res_json.get('origin', None) # 測試proxy是否可用,可用則返回proxy async with session.get(url='https://www.httpbin.org/ip', proxy=f'http://{proxy}') as response: res_json = await response.json() _proxy_ip = res_json.get('origin', None) # 斷言判斷 # For example: # proxy = 1.2.4.8:1080 # _proxy_ip = https://www.httpbin.org/ip的響應1.2.4.8(代理) # origin_ip = https://www.httpbin.org/ip的響應8.8.8.8(不代理) # proxy_ip = 1.2.4.8 assert origin_ip != _proxy_ip proxy_ip = proxy.split(":")[0] assert proxy_ip == _proxy_ip if TEST_URL_SWITCH: # 測試proxy是否可訪問測試url async with session.get(TEST_URL, proxy=f'http://{proxy}') as response: if response.status == 200: again_access_proxy = True if access_proxy or again_access_proxy: # 設定最高分 self.redis.max(REDIS_KEY, proxy) else: # 代理不可用則減分 self.redis.decrease(REDIS_KEY, proxy) except Exceptions: # 代理不可用則減分 self.redis.decrease(REDIS_KEY, proxy) self.logger.debug(f'proxy {proxy} is invalid') @Logging.catch() def run(self): self.logger.info('starting tester......') count = self.redis.get_count(REDIS_KEY) self.logger.debug(f'{count} proxies to test') cursor = 0 while True: self.logger.debug(f'Testing proxies use cursor {cursor}, count {COUNT}') #批量獲取代理 cursor, proxies = self.redis.batch(REDIS_KEY, cursor, COUNT) if proxies: # 從redis取出來的value預設是python的bytes型別,bytes_convert_string方法轉換 tasks = [self.test(bytes_convert_string(proxy[0])) for proxy in proxies] self.loop.run_until_complete(asyncio.wait(tasks)) if not cursor: break def runtest(): # 跳過redis測試單個代理 proxy = '185.78.228.24:8000' tests = [tester.test(proxy)] tester.loop.run_until_complete(asyncio.wait(tests))
其它模組
# proxypool/untils/parse.py
# redis預設輸出的元素是bytes型別,這裡做轉換
def bytes_convert_string(data):
if data is None:
return None
if isinstance(data, bytes):
return data.decode('utf8')
# proxypool/untils/loggings.py # 封裝一個日誌類,以便後續每個模組可直接呼叫它 from loguru import logger import time from pathlib import Path class Logging(): """ 日誌記錄 """ _instance = None t = time.strftime('%Y_%m_%d') dir = Path.cwd().joinpath('log') logger.add(f'{dir}/crawl_{t}.log', enqueue=True, rotation='00:00', retention='1 months', compression='tar.gz', encoding='utf-8', backtrace=True) def __new__(cls, *arg, **kwargs): if cls._instance is None: cls._instance = object.__new__(cls, *arg, **kwargs) return cls._instance def info(self, msg): return logger.info(msg) def debug(self, msg): return logger.debug(msg) def error(self, msg): return logger.error(msg) def exception(self, msg): return logger.exception(msg) @classmethod def catch(cls): return logger.catch
3. 總結
檢測模組原理比較簡單,當我們拿到一個代理後,想判斷其是否可用,可使用訪問網站方式去測試,比如直接訪問百度也是可以的,為了確保代理的可用性,建議測試不少於一遍。