【原創】python encoding中文編碼
設定方式如下:
在python的lib目錄下site-packages目錄中,新建sitecustomize.py
,
C:\Python27\lib\site-pachages\sitecustomize.py
輸入以下內容,儲存關閉。
# this file can be anywhere in your Python path,
# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('iso-8859-1')#
每設定完後重新執行Python IDE。
結果如下:
一、iso-8859-1
>>> import sys
>>> sys.getdefaultencoding()
'iso-8859-1'
>>> s=u'我是中國人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中國人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中國人
>>>
二、ascii 預設編碼方式,可以不用新建sitecustomize.py
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> s=u'我是中國人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中國人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中國人
>>>
三、UTF-8
>>> import sys
>>> sys.getdefaultencoding()
'UTF-8'
>>> s=u'我是中國人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中國人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中國人
>>>
四、gb2312
>>> import sys
>>> sys.getdefaultencoding()
'gb2312'
>>> s=u'我是中國人'
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>> s='我是中國人'
>>> s
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
我是中國人
>>>
發現沒,他們輸出的結果都一樣…這讓我表示鬱悶,那設定這個有什麼用嗎?
按照書上的說法,設定預設的編碼後,可以這樣來用。
>>> s=u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb' #正好對應‘我是中國人’
>>> s
u'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> print s
ÎÒÊÇÖйúÈË
>>>
但是四種編碼方式都一樣的,這結果讓我更不知所措了…
剛才試著讀中文格式的xml,報錯了,xml文字如下:
<?xml version="1.0" encoding="gb2312"?>
<preface>
<title>我是中國人</title>
</preface>
Python IDE如下:
>>> from xml.dom import minidom
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>>
這是什麼問題,我也不懂了,然後想了想一般編碼不區分大小寫的,但是抱著試一試的心態把xml中的編碼方式改了一下,改成GB2312,接著讀xml:
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>>
額,這是為什麼,沒有異常了。
既然這樣,那就這樣吧,也只能這樣了,以後大家寫xml或者html或者其他地方要寫編碼儘量用大寫吧!
不過接著就出問題了,試著把讀取到的東西輸出:
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>> title=xmldoc.getElementsByTagName_r('title')[0].firstChild.data
>>> title
u'\u6445\u646e\ufae0$\u6563\u726f\u876e\u5f73\u1a50\u01fe'
>>> print title
攄摮$散牯蝮彳ᩐǾ
【釋】這..這..又是亂碼!
要不試著將文字換換編碼再輸出?
>>> converttitle=title.encode('GB2312')
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
converttitle=title.encode('GB2312')
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence
>>> converttitle=title.encode('UTF-8')
>>> converttitle
'\xe6\x91\x85\xe6\x91\xae\xef\xab\xa0$\xe6\x95\xa3\xe7\x89\xaf\xe8\x9d\xae\xe5\xbd\xb3\xe1\xa9\x90\xc7\xbe'
>>> print converttitle
攄摮$散牯蝮彳ᩐǾ
>>> a='我是中國人'
>>> a
'\xce\xd2\xca\xc7\xd6\xd0\xb9\xfa\xc8\xcb'
>>> converttitle=title.encode('gb2312')
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
converttitle=title.encode('gb2312')
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u646e' in position 1: illegal multibyte sequence
>>> converttitle=title.encode('ascii')
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
converttitle=title.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> converttitle=title.encode('iso-8859-1')
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
converttitle=title.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
>>>
這怎麼辦,只有utf-8 能輸出,但是還是亂碼昂..這個問題,有待研究呀。
還有個問題,xml還是GB2312,使用這個使用者配置編碼sitecustomize.py 分別設定為iso-8859-1 ascii utf-8 GB2312 ,對xml進行讀取。
>>> import sys
>>> sys.getdefaultencoding()
'GB2312'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
>>> import sys
>>> sys.getdefaultencoding()
'UTF-8'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>>
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
>>>
>>> import sys
>>> sys.getdefaultencoding()
'iso-8859-1'
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc = minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
【釋】發現沒有,UTF-8和ASCII 會報ExpatError: not well-formed (invalid token): line 3, column 10 的異常,這不是之前大小寫的問題嗎?那為什麼GB2312和iso-8859-1 會報ExpatError: unknown encoding: line 1, column 30的錯誤呢?要不把這個配置的編碼刪掉再試試:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: unknown encoding: line 1, column 30
>>>
錯了錯了,顛覆了之前大小寫的原因了。而且倫亂了,無論gb2312還是GB2312都是unknown encoding了,未知編碼方式…
再試試,配置編碼留著,但是什麼都不做,只是import sys,下面#註釋,IDE結果如下:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> from xml.dom import minidom
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
xmldoc=minidom.parse('./mytest/russiansample.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1921, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 3, column 10
這不就是那個之前大小寫時候的錯誤嗎?哎,弄大小寫了,沒用….啊呀,那這個到底怎麼讀xml內容啊!難道前面成功讀取一次的只是巧合嗎?
額,倫亂了倫亂了,不過我試著重新拷貝了一個qq的xml SSOConfig.xml,然後修改如下:
<?xml version="1.0" encoding="utf-8" ?>
<i18n>
<StringBundle>
地區資訊,目前只需要一個, SSOPlatform不需要地區資訊
</StringBundle>
</i18n>
>>> xmldoc=minidom.parse('./mytest/SSOConfig.xml')
>>>
擦,終於可以了,然後試著把這段話複製到russiansample.xml 中:
<?xml version="1.0" encoding="utf-8" ?>
<i18n>
<StringBundle>
地區資訊,目前只需要一個, SSOPlatform不需要地區資訊
</StringBundle>
<preface>
<title>
我是中國人
</title>
</preface>
</i18n>
這個xml對吧?我認為沒有問題,但是還是一樣的異常,一模一樣的檔案內容,就是名字不一樣就會報錯嗎,我不信了,終於我發現問題了,告訴你們一個很重要的資訊,那就是檔案編碼格式!!這個問題糾結了很久,來試試看吧!開啟russiansample.xml 另存為-編碼預設是ANSI選擇UTF-8,儲存並替換。
>>> xmldoc=minidom.parse('./mytest/russiansample.xml')
>>> xmldoc.getElementsByTagName_r('title')
[<DOM Element: title at 0x21f3328>]
>>> title = xmldoc.getElementsByTagName_r('title')[0].firstChild.data
>>> title
u'\n\t\t\u5730\u533a\u4fe1\u606f\uff0c\u76ee\u524d\u53ea\u9700\u8981\u4e00\u4e2a, SSOPlatform\u4e0d\u9700\u8981\u5730\u533a\u4fe1\u606f\n\t'
>>> print title
地區資訊,目前只需要一個, SSOPlatform不需要地區資訊
>>> c=title.encode('gb2312')
>>> c
'\n\t\t\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\xa3\xac\xc4\xbf\xc7\xb0\xd6\xbb\xd0\xe8\xd2\xaa\xd2\xbb\xb8\xf6, SSOPlatform\xb2\xbb\xd0\xe8\xd2\xaa\xb5\xd8\xc7\xf8\xd0\xc5\xcf\xa2\n\t'
>>> print c
地區資訊,目前只需要一個, SSOPlatform不需要地區資訊
>>>
終於成功了,而且不需要再轉碼輸出了,我不要再試了。最後再說一句,檔案編碼方式很重要,這個尤其的windows上!
本人親測:xml中的encoding是UTF-8的時候,檔案儲存格式一定要是utf-8.這樣直接開啟IDE就可以讀取xml。
另附:QQ好像所有的xml檔案都是utf-8編碼和儲存的;百度好像大部分是gb2312。
自從那一次大小寫的問題讀出了gb2312的xml後,目前為止再也沒有碰到過,哪怕結果是亂碼也沒有,都是異常。儘量用utf-8吧,基本可以解決一切xml編碼問題。
9/15/2013 17:57:47
原創所有,轉載請附加本文連結,謝謝!