MacOS下使用爬蟲發生urllib.error.URLError
阿新 • • 發佈:2018-12-08
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1045)>
發生了這樣的錯誤,但是在Windows10下就不會發生,查到了一許流星的部落格,發現可以這樣解決,但是隻是解決問題,並沒有理解問題發生的原因
在他部落格中
可能原因分析:
Python 2.7.9 之後引入了一個新特性 當你urllib.urlopen一個 https 的時候會驗證一次 SSL 證書 當目標使用的是自簽名的證書時就會爆出一個 urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)> 的錯誤訊息
處理方式如下:
import ssl
# This restores the same behavior as before.
context = ssl._create_unverified_context()
response = urllib.request.urlopen("https://no-valid-cert", context=context)
如此這樣那我在網上copy的程式碼(爬去熊貓TV英雄聯盟的主播rank)就可以完美執行
import re from urllib import request import ssl context = ssl._create_unverified_context() class Spider(): # 匹配所有字元 [\s\S]*? 非貪婪 url = 'https://www.panda.tv/cate/lol' root_pattern = '<div class="video-info">([\w\W]*?)</div>' name_pattern = '</i>([\w\W]*?)</span>' number_pattern = '<span class="video-number">([\w\W]*?)</span>' def __fetch_content(self): r = request.urlopen(Spider.url, context=context) # 位元組碼 htmls = r.read() htmls = str(htmls, encoding='utf-8') return htmls def __analysis(self, htmls): root_html = re.findall(Spider.root_pattern, htmls) anchors = [] for html in root_html: name = re.findall(Spider.name_pattern, html) number = re.findall(Spider.number_pattern, html) anchor = {'name': name, 'number': number} anchors.append(anchor) return anchors def __refine(self, anchors): # 匿名函式lambda l = lambda anchor: {'name': anchor['name'][0].strip(), 'number': anchor['number'][0]} return map(l, anchors) def __sort(self, anchors): # 預設增序 anchors = sorted(anchors, key=self.__sort_seed, reverse=True) return anchors def __sort_seed(self, anchor): r = re.findall('\d*', anchor['number']) number = float(r[0]) if '萬' in anchor['number']: number *= 10000 return number def __show(self, anchors): for rank in range(0, len(anchors)): print('rank' + str(rank + 1) + ':' + anchors[rank]['name'] + ' ' + anchors[rank]['number']) def go(self): htmls = self.__fetch_content() anchors = self.__analysis(htmls) anchors = list(self.__refine(anchors)) anchors = self.__sort(anchors) self.__show(anchors) spider = Spider() spider.go()
最後