Python判斷網頁編碼

阿新 • • 發佈：2019-01-29

有一種渴，只有酒才能滋潤，這種渴就是孤獨。

根據網頁返回編碼尋找資料

比如我要找到這個網頁的標題，那麼直接正則匹配(.*?)就可以，但是許多時候因為編碼問題requests這個庫沒辦法正確解析，所以獲取不到資料。

解決辦法：

        r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5)
        if r_port_top.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')

這種辦法就是先判斷網頁的編碼，然後轉換之。但是有的時候是utf-8編碼就沒辦法，接下來來個終極版的。

    try:
        UA = random.choice(headerss)
        headers = {'User-Agent': UA}
        r_port_top = requests.get(url=str('http://'+url), headers=headers, timeout=5)
        if r_port_top.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')
        elif r_port_top.encoding == 'GB2312':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')
        elif r_port_top.encoding == 'gb2312':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')
        elif r_port_top.encoding == 'GBK':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')
        elif r_port_top.encoding == 'gbk':
            encodings = requests.utils.get_encodings_from_content(r_port_top.text)
            if encodings:
                encoding = encodings[0]
            else:
                encoding = r_port_top.apparent_encoding
            encode_content = r_port_top.content.decode(encoding, 'replace').encode('utf-8', 'replace')
            port_title = re.search('<title>(.*?)</title>', encode_content, re.S).group().replace('<title>',
                                                                                                 '').replace(
                '</title>', '')
        else:
            port_title = re.search('<title>(.*?)</title>', r_port_top.content, re.S).group().replace('<title>',
                                                                                                     '').replace(
                '</title>', '')
    except:
        try:
            port_title = re.search('<title>(.*?)</title>', r_port_top.content, re.S).group().replace('<title>',
                                                                                                     '').replace(
                '</title>', '')
        except:
            port_title = '暫時無法獲取網站標題'

使用chardet直接判斷轉換

上面那個方法實在是太傻了，使用chardet輕鬆解決網頁編碼問題。

# -*- coding: utf-8 -*-
# @Time    : 2018/5/4 0004 8:55
# @Author  : Langzi
# @Blog    : www.langzi.fun
# @File    : get urls.py
# @Software: PyCharm
import sys
import chardet
import re
import requests

reload(sys)
sys.setdefaultencoding('utf-8')

url = 'https://stackoverflow.com'
d1 = requests.get(url)
print d1.content
if isinstance(d1.content,unicode):
    pass
else:
    codesty = chardet.detect(d1.content)
    a = d1.content.decode(codesty['encoding'])

得到的a就是網頁最終編碼後的結果，這個時候直接re.search(‘(.*?)‘,a)就可以達到了匹配所有網址的標題了。

個人部落格：www.langzi.fun
歡迎交流Python開發，安全測試。

Python判斷網頁編碼

有一種渴，只有酒才能滋潤，這種渴就是孤獨。根據網頁返回編碼尋找資料比如我要找到這個網頁的標題，那麼直接正則匹配(.*?)就可以，但是許多時候因為編碼問題requests這個庫沒辦法正確解析，所以獲取不到資料。解決辦法： r_

python判斷網頁編碼方式

在解析網頁的過程中往往會遇到網頁亂碼的情況，自己去看網頁原始碼往往不太準確，這時候就需要用程式碼去判斷網頁原始碼的編碼方式，一共介紹三種方式：（推薦第二種）這裡以百度首頁為例 url = 'htt

python判斷檔案編碼型別

import chardet for file in ['decode.v', 'fen_ping.v', 'ji_shu.v', 'scanning.v', 'top_level.v']: file_path = 'F:/github_project

Python獲取網頁編碼

Python獲取網頁編碼在做爬蟲的時候有的網站中的網頁可能有不同的編碼方式，我們則需要把獲取到的資料根據網頁編碼方式獲取。所以我們要先判斷當前網頁使用的是何種編碼方式，為此我使用requests庫解析當前頁的編碼方式。 requests庫是常用的網頁解析庫，也是我做爬蟲時一直使用的庫。其中

python判斷字串編碼——使用chardet判斷字串編碼的方法

本文例項講述了python使用chardet判斷字串編碼的方法。分享給大家供大家參考。具體分析如下：最近利用python抓取一些網上的資料，遇到了編碼的問題。非常頭痛，總結一下用到的解決方案。 linux中vim下檢視檔案編碼的命令 set fileencoding python中一個強力的編碼檢

Python爬蟲系列：判斷目標網頁編碼的幾種方法

qpi data- tps 分享運行 ofo html nbsp 來看在爬取網頁內容時，了解目標網站所用編碼是非常重要的，本文介紹幾種常用的方法，並使用幾個網站進行簡單測試。代碼運行結果：從不同國家的幾個網站測試結果來看，utf8使用的較多（對於純英文網站，用什

python 2.7 用chardet解決爬蟲獲取網頁編碼的識別判斷和轉換問題

python3以前版本的最頭痛問題就是編碼的轉換，在爬取不同網頁的時候發現編碼有N多種，目前已爬取到的網頁編碼如下： gbk utf-8 UTF-8-SIG SHIFT_JIS GB2312 Big5 EUC-JP ascii CP932 EUC-KR

[python]獲取網頁中內容為漢字的字符串的判斷

vsr rbo ats art htm acad for swe lin IPerf%E2%80%94%E2%80%94%E7%BD%91%E7%BB%9C%E6%B5%8B%E8%AF%95%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D%E4%B

判斷網頁能否打開？python

chrom == \ufeff users os x 斷網 python err top #!/usr/bin/env python# coding=utf-8import requestsheaders = { "User-Agent": "Mozilla/5.0

Python判斷字串、檔案字元編碼

本段工具程式碼用於判斷字串或者文字檔案的字元編碼型別，可以識別常用的UTF-8，UTF-8-SIG，UTF-16，GBK，GB2312 ，GB18030 ，ASCII字元編碼格式，如果有特殊字符集需求，可以擴充字元編碼列表。程式碼如下： [charse

Python 抓取網頁學習系列之一（網頁編碼格式

第一步：你需要知道你所抓取的網頁編碼格式，方法：右鍵檢視網頁編碼，ctrl +F ，搜尋：charset會看到：<meta http-equiv="Content-Type" content="text/html; charset=gbk" />，OK確定是gb

Python判斷檔案和字串編碼型別的例項

# 說明：UTF相容ISO8859-1和ASCII，GB18030相容GBK，GBK相容GB2312，GB2312相容ASCIICODES = ['UTF-8', 'UTF-16', 'GB18030', 'BIG5']# UTF-8 BOM字首位元組UTF_8_BOM = b'\xef\xbb\xbf'#

python urllib爬取網頁編碼問題

利用python urllib庫爬取網頁，有時獲得的網頁列印或寫檔案遇到編碼問題，找了許久終於知道為什麼了。首先利用urlopen()函式獲取網頁物件，再利用info()函式列印網頁的相關資訊，確定網頁的編碼及是否壓縮。 import urllib.request f

python爬蟲解析網頁編碼問題

最近做了一個需求，爬取n多新聞新聞url的正文。這些url是從百度新聞搜尋關鍵字爬取下來的碰到gbk，gb2132,utf-8網頁編碼解析時都能跑通。以下是部分程式碼，記錄一下def run(self): filename = u'D:\scrapyProject\l

Python判斷檔案和字串編碼型別

python判斷檔案和字串編碼型別可以用chardet工具包，可以識別大多數的編碼型別。但是前幾天在讀取一個Windows記事本儲存的txt檔案時，GBK卻被識別成了KOI8-R，無解。然後就自己寫了個簡單的編碼識別模組，基本可以識別大部分常用編碼，如果有特殊需要，可以自

[Python網路程式設計]gevent httpclient以及網頁編碼

之前看到geventhttpclient這個專案，https://github.com/gwik/geventhttpclient，官方文件說非常快，由於響應使用了C的解析，所以我一直想把這玩意用到專案中，這兩天一直在糾結這玩意，說實在一句話，比較難用，封裝的不給力，最大缺陷

Python3.x 爬蟲學習筆記——判斷網頁的編碼方式

(以後寫部落格儘量用MarkDown，我還是太low了) 更新：後來發現這玩意沒有加上headers，有些網頁打不開啊，於是就加上了程式碼如下 def find_type(url,Headers): data1 = urllib.request.urlopen(url

簡單的python判斷基偶數練習

一個 pytho 程序 color ont style 奇數 log code #!/usr/bin/env python# Author:William Huangnum = int(input(‘please input your number:‘)) # 用int(

python——判斷、循環語句

生成多重一個執行判斷列表 list 優秀循環語句簡單判斷語句：if… 一重判斷語句：if…else… 多重判斷語句：if elif else… Score=input(“請輸入你的分數”) Score=int(score) If score&

python中關於編碼，json格式的中文輸出顯示

pri 整體 pytho src repr 接口 ensure 輸出 unicode 但我們用requests請求一個返回json的接口時候，語法是 result=requests.post(url,data).content print type(result),re

Python判斷網頁編碼

根據網頁返回編碼尋找資料

使用chardet直接判斷轉換

相關推薦