Python 轉換文字編碼實現解析

阿新 • • 發佈：2020-01-09

最近在做週報的時候，需要把csv文字中的資料提取出來製作表格後生產圖表。

在獲取csv文字內容的時候，基本上都是用with open(filename,encoding ='UTF-8') as f:來開啟csv文字，但是實際使用過程中發現有些csv文字並不是utf-8格式，從而導致程式在run的過程中報錯，每次都需要手動去把該文字檔案的編碼格式修改成utf-8，再次來run該程式，所以想說：直接在程式中判斷並修改文字編碼。

基本思路：先查詢該文字是否是utf-8的編碼，如果不是則修改為utf-8編碼的文字，然後再處理。

python有chardet庫可以檢視到文字的encoding資訊：

detect函式只需要一個非unicode字串引數，返回一個字典（例如：{'encoding': 'utf-8','confidence': 0.99}）。該字典包括判斷到的編碼格式及判斷的置信度。

import chardet
def get_encode_info(file):
  with open(file,'rb') as f:
    return chardet.detect(f.read())['encoding']

不過這個在從處理小檔案的時候效能還行，如果文字稍微過大就很慢了，目前我本地的csv檔案是近200k，就能明顯感覺到速度過慢了，效率低下。不過chardet庫中提供UniversalDetector物件來處理：建立UniversalDetector物件，然後對每個文字塊重複呼叫其feed方法。如果檢測器達到了最小置信閾值，它就會將detector.done設定為True。

一旦您用完了源文字，請呼叫detector.close()，這將完成一些最後的計算，以防檢測器之前沒有達到其最小置信閾值。結果將是一個字典，其中包含自動檢測的字元編碼和置信度(與charde.test函式返回的相同)。

from chardet.universaldetector import UniversalDetector
def get_encode_info(file):
 with open(file,'rb') as f:
    detector = UniversalDetector()
 for line in f.readlines():
      detector.feed(line)
 if detector.done:
 break
    detector.close()
 return detector.result['encoding']

在做編碼轉換的時候遇到問題：UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):
 with open(file,'rb') as f:
 return f.read()
def write_file(content,file):
 with open(file,'wb') as f:
    f.write(content)
def convert_encode2utf8(file,original_encode,des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode)  #-->此處有問題
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode,file)

這是由於byte字元組沒解碼好，要加另外一個引數errors。官方文件中寫道：

bytearray.decode(encoding=”utf-8”,errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict',meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore','replace' and any other name registered via codecs.register_error(),see section Error Handlers. For a list of possible encodings,see section Standard Encodings.

意思就是字元陣列解碼成一個utf-8的字串，可能被設定成不同的處理方案，預設是‘嚴格'的，有可能丟擲UnicodeError，可以改成‘ignore'，'replace'就能解決。

所以將此行程式碼file_decode = file_content.decode(original_encode)修改成file_decode = file_content.decode(original_encode,'ignore')即可。

完整程式碼：

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file,'rb') as f:
   detector = UniversalDetector()
   for line in f.readlines():
     detector.feed(line)
     if detector.done:
       break
   detector.close()
   return detector.result['encoding']

def read_file(file):
  with open(file,'rb') as f:
    return f.read()

def write_file(content,file):
  with open(file,'wb') as f:
    f.write(content)

def convert_encode2utf8(file,des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode,'ignore')
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode,file)

if __name__ == "__main__":
  filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
  file_content = read_file(filename)
  encode_info = get_encode_info(filename)
  if encode_info != 'utf-8':
    convert_encode2utf8(filename,encode_info,'utf-8')
  encode_info = get_encode_info(filename)
  print(encode_info)

參考：https://chardet.readthedocs.io/en/latest/usage.html

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支援我們。

Python 轉換文字編碼實現解析

Python 轉換文字編碼實現解析

使用python批量轉換檔案編碼為UTF-8的實現

Python連線HDFS實現檔案上傳下載及Pandas轉換文字檔案到CSV操作

Python word文字自動化操作實現方法解析

SpringMVC自定義型別轉換器實現解析

python單例設計模式實現解析

python自動識別文字編碼格式程式碼

python return邏輯判斷表示式實現解析

python使用opencv在Windows下呼叫攝像頭實現解析

Python計算不規則圖形面積演算法實現解析

PYTHON傳送郵件YAGMAIL的簡單實現解析

Python Django框架url反向解析實現動態生成對應的url連結示例

python 程序間資料共享multiProcess.Manger實現解析

Python爬取知乎圖片程式碼實現解析

Python 讀取使用者指令和格式化列印實現解析

Python流程控制 if else實現解析

Python流程控制 while迴圈實現解析

Python 70行程式碼實現簡單算式計算器解析

python 命令列傳入引數實現解析

Python shelve模組實現解析

Python 轉換文字編碼實現解析

相關推薦