【實戰】免費代理!
阿新 • • 發佈:2018-12-18
引言
作為一個個人爬蟲開發,最苦惱的事之一肯定是代理ip的問題。今天我們就自己動手來做一個可用的代理IP池。
需求分析
爬取西刺代理網站中可用的高匿代理。
知識點
爬取資料:Requests
資料篩選:Beautifulsoup
資料庫:Mongo
主要程式碼
網站內容很簡單,這裡就不做過多的解析了。直接放出部分程式碼
傳送requests請求:
def get_response(self): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", } self.response = requests.get(self.url, headers=headers).text
提取目標內容:
def get_ip_info_list(self): soup = BeautifulSoup(self.response,"lxml") ip_list = (soup.find(id="ip_list")) ip_detail = ip_list.find_all(name="tr") ip_detail = ip_detail[1:] for ip in ip_detail: item = {} item['ip'] = ip.find_all(name = "td")[1].string item['port'] = ip.find_all(name = 'td')[2].string try: item['location'] = ip.find_all(name = 'td')[3].find(name = "a").string except: item['location'] = ip.find_all(name='td')[3].string.strip() item['anonymous'] = ip.find_all(name = 'td')[4].string item['type'] = ip.find_all(name = 'td')[5].string item['speed'] = ip.find_all(name = 'td')[6].find(class_ = 'bar').attrs['title'] item['connect_time'] = ip.find_all(name = 'td')[7].find(class_ = 'bar').attrs['title'] item['alive_time'] = ip.find_all(name = 'td')[8].string item['verify_time'] = ip.find_all(name = 'td')[9].string if self.check_verify_time("20"+item['verify_time'].split(" ")[0]): if not check_proxy_duplicate(item): yield item else: return
檢查ip是否重複:
def check_proxy_duplicate(proxy): ip = proxy['ip'] curr = pymongo.MongoClient() db = curr['proxy'] collection = db['proxy'] ip_exist = collection.find({"ip":ip}) ip_exist_list = [] for i in ip_exist: ip_exist_list.append(i) if ip_exist_list : print("%s已經存在"%ip) return True else: return False
儲存至資料庫:
def save_mongo(item):
curr = pymongo.MongoClient()
db = curr['proxy']
collection = db['proxy']
collection.insert_one(item)
curr.close()
print("%s 儲存完成"%(item['ip']))
檢查代理是否可用:
def check_proxy_enable(proxy):
proxy_string = proxy['type'] + "://" + proxy['ip'] + ":" + proxy['port']
if proxy['type'] == "HTTP":
proxy_for_check = {'HTTP':proxy_string}
elif proxy['type'] == 'HTTPS':
proxy_for_check = {'HTTPS': proxy_string}
try:
requests.get("http://www.sina.com.cn",proxies=proxy_for_check)
except:
del_proxy_from_mongo(proxy)
else:
update_proxy_from_mongo(proxy)
爬取結果
從mongoDB中檢視:
原始碼
連結:連結:https://pan.baidu.com/s/1MgwmhUKLnpTKnI-HF2JChQ 提取碼 提取碼:4lwi