Python判斷網頁編碼
阿新 • • 發佈:2019-01-29
有一種渴,只有酒才能滋潤,這種渴就是孤獨。
根據網頁返回編碼尋找資料
比如我要找到這個網頁的標題,那麼直接正則匹配(.*?)就可以,但是許多時候因為編碼問題requests這個庫沒辦法正確解析,所以獲取不到資料。
解決辦法:
r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5) if r_port_top.encoding == 'ISO-8859-1': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '')
這種辦法就是先判斷網頁的編碼,然後轉換之。但是有的時候是utf-8編碼就沒辦法,接下來來個終極版的。
try: UA = random.choice(headerss) headers = {'User-Agent': UA} r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5) if r_port_top.encoding == 'ISO-8859-1': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '') elif r_port_top.encoding == 'GB2312': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '') elif r_port_top.encoding == 'gb2312': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '') elif r_port_top.encoding == 'GBK': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '') elif r_port_top.encoding == 'gbk': encodings = requests.utils.get_encodings_from_content(r_port_top.text) if encodings: encoding = encodings[0] else: encoding = r_port_top.apparent_encoding encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace') port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>', '').replace( '</title>', '') else: port_title = re.search('<title>(.*?)</title>', r_port_top.content, re.S).group().replace('<title>', '').replace( '</title>', '') except: try: port_title = re.search('<title>(.*?)</title>', r_port_top.content, re.S).group().replace('<title>', '').replace( '</title>', '') except: port_title = '暫時無法獲取網站標題'
使用chardet直接判斷轉換
上面那個方法實在是太傻了,使用chardet輕鬆解決網頁編碼問題。
# -*- coding: utf-8 -*- # @Time : 2018/5/4 0004 8:55 # @Author : Langzi # @Blog : www.langzi.fun # @File : get urls.py # @Software: PyCharm import sys import chardet import re import requests reload(sys) sys.setdefaultencoding('utf-8') url = 'https://stackoverflow.com' d1 = requests.get(url) print d1.content if isinstance(d1.content,unicode): pass else: codesty = chardet.detect(d1.content) a = d1.content.decode(codesty['encoding'])
得到的a就是網頁最終編碼後的結果,這個時候直接re.search(‘(.*?)‘,a)就可以達到了匹配所有網址的標題了。
個人部落格:www.langzi.fun
歡迎交流Python開發,安全測試。