Python: 轉換文字編碼

最近在做週報的時候，需要把csv文字中的資料提取出來製作表格後生產圖表。

在獲取csv文字內容的時候，基本上都是用with open(filename, encoding ='UTF-8') as f:來開啟csv文字，但是實際使用過程中發現有些csv文字並不是utf-8格式，從而導致程式在run的過程中報錯，每次都需要手動去把該文字檔案的編碼格式修改成utf-8，再次來run該程式，所以想說：直接在程式中判斷並修改文字編碼。

基本思路：先查詢該文字是否是utf-8的編碼，如果不是則修改為utf-8編碼的文字，然後再處理。

python有chardet庫可以檢視到文字的encoding資訊：

detect函式只需要一個非unicode字串引數，返回一個字典（例如：{'encoding': 'utf-8', 'confidence': 0.99}）。該字典包括判斷到的編碼格式及判斷的置信度。

import chardet

def get_encode_info(file):
    with open(file, 'rb') as f:
        return chardet.detect(f.read())['encoding']

不過這個在從處理小檔案的時候效能還行，如果文字稍微過大就很慢了，目前我本地的csv檔案是近200k，就能明顯感覺到速度過慢了，效率低下。不過chardet庫中提供UniversalDetector物件來處理：建立UniversalDetector物件，然後對每個文字塊重複呼叫其feed方法。如果檢測器達到了最小置信閾值，它就會將detector.done設定為True。一旦您用完了源文字，請呼叫detector.close()，這將完成一些最後的計算，以防檢測器之前沒有達到其最小置信閾值。結果將是一個字典，其中包含自動檢測的字元編碼和置信度(與charde.test函式返回的相同)。

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
        detector = UniversalDetector()
 for line in f.readlines():
            detector.feed(line)
 if detector.done:
 break
        detector.close()
 return detector.result['encoding']

在做編碼轉換的時候遇到問題：UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):
 with open(file, 'rb') as f:
 return f.read()

def write_file(content, file):
 with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode)   #-->此處有問題
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

這是由於byte字元組沒解碼好，要加另外一個引數errors。官方文件中寫道：

bytearray.decode(encoding=”utf-8”, errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

意思就是字元陣列解碼成一個utf-8的字串，可能被設定成不同的處理方案，預設是‘嚴格’的，有可能丟擲UnicodeError，可以改成‘ignore’，’replace’就能解決。

所以將此行程式碼file_decode = file_content.decode(original_encode)修改成file_decode = file_content.decode(original_encode,'ignore')即可。

完整程式碼：

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
     detector = UniversalDetector()
     for line in f.readlines():
         detector.feed(line)
         if detector.done:
             break
     detector.close()
     return detector.result['encoding']

def read_file(file):
    with open(file, 'rb') as f:
        return f.read()

def write_file(content, file):
    with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode,'ignore')
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

if __name__ == "__main__":
    filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
    file_content = read_file(filename)
    encode_info = get_encode_info(filename)
    if encode_info != 'utf-8':
        convert_encode2utf8(filename, encode_info, 'utf-8')
    encode_info = get_encode_info(filename)
    print(encode_info)

參考：https://chardet.readthedocs.io/en/latest/usage.html

Python: 轉換文字編碼

Python: 轉換文字編碼

python中文字符串編碼問題

python實現中文轉換url編碼的方法

python文件編碼轉換流程

python在gbk編碼轉換成utf-8時亂碼問題

python自動識別文字編碼格式

python 轉換為json時候漢字編碼問題

Python中unicode編碼的字串和其他格式的字串之間進行轉換

Python輸出\u編碼將其轉換成中文

【中文編碼】使用Python處理中文時的文字編碼問題

python實現文字與Excel內容相互轉換

python中字串編碼轉換

文字檔案轉換字元編碼（UTF-8等）的通用方法

使用python實現批量轉換檔案編碼格式

利用nodepad++中的python script外掛批量轉換檔案編碼為utf-8

C# 文字編碼轉換

Python圖片文字識別轉換程式

跨平臺的文字編碼轉換方法--ICU

python中關於編碼，json格式的中文輸出顯示

python版本與編碼的區別

Python: 轉換文字編碼

相關推薦