python os.walk filename ‘ascii’ codec can’t decode

阿新 • • 發佈：2018-12-27

延續前一篇的文章「Python 裡中文目錄與os.path.join問題」，這次是 os.walk filename ‘ascii’ codec can’t decode，

程式碼：

for root, dirs, files in os.walk(startpath):
    for f in files:
        path = '{}/{}'.format(path, f)

I can reproduce the os.listdir() behavior: os.listdir(unicode_name) returns undecodable entries as bytes on Python 2.7:

>>> import os
>>> os.listdir(u'.')
[u'abc', '<--\x8b-->']

Notice: the second name is a bytestring despite listdir()‘s argument being a Unicode string.

A big question remains however – how can this be solved without resorting to this hack?

Python 3 solves undecodable bytes (using filesystem’s character encoding) bytes in filenames via surrogateescape

error handler (os.fsencode/os.fsdecode). See PEP-383: Non-decodable Bytes in System Character Interfaces:

>>> os.listdir(u'.')
['abc', '<--\udc8b-->']

Notice: both string are Unicode (Python 3). And surrogateescape error handler was used for the second name. To get the original bytes back:

>>> os.fsencode('<--\udc8b-->')
b'<--\x8b-->'

In Python 2, use Unicode strings for filenames on Windows (Unicode API), OS X (utf-8 is enforced) and use bytestrings on Linux and other systems.

兩光解法：

for f in files:
    try:
        n = posixpath.join(rcontext, f)
    except UnicodeDecodeError:
        n = posixpath.join(rcontext, f.decode('utf-8'))
    # ...

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is ‘ascii’, while the default Linux encoding is ‘utf8’. You can verify these encodings via:

sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can’t be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That’s the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):
    if type(name) == str: # leave unicode ones alone
        try:
            name = name.decode('utf8')
        except:
            name = name.decode('windows-1252')
    return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):
    files = [decodeName(f) for f in files]
    # do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

I highly recommend it for anyone struggling with unicode issues.

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string.

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding.

python os.walk filename ‘ascii’ codec can’t decode

相關文章：

python os.walk filename ‘ascii’ codec can’t decode

python問題：UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position

python 編碼問題：'ascii' codec can't encode characters in position 的解決方案

python 字元編碼與解碼——unicode、str和中文：UnicodeDecodeError: 'ascii' codec can't decode

Python各種錯誤之 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 33: ordinal not in

[python]解決Windows下安裝第三方外掛報錯：UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0:

python 安裝scrapy錯誤提示：UnicodeDecodeError: 'ascii' codec can't decode byte 解決方案

真正解決python UnicodeDecodeError: ‘ascii’ codec can’t decode byte ……

解決UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe5 in position 108: ordinal not in range(128)

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xbd in position 11: ordinal not in range(128)

成功解決Python3版UnicodeDecodeError 'ascii' codec can't decode b

python報錯"utf-8 codec can't decode byte 0x"

webpy 解決中文出現UnicodeDecodeError: 'ascii' codec can't decode byte 問題

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 1: ordinal not in range(128)

pip 安裝pandas報UnicodeDecodeError: 'ascii' codec can't decode byte 0xd5錯

解決 Python2.7 報錯 UnicodeDecodeError: 'ascii' codec can't decode...

字串繼續編碼報UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in rang

問題解決：Conan 報"'ascii' codec can't decode byte 0xe5 in position 36: ordinal not in range(128)"

python 編碼問題 UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xb4 in position 0: invalid start byt

python中文編碼問題解決UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-14: ordinal n

python os.walk filename ‘ascii’ codec can’t decode

相關文章：

相關推薦