unicode是一種編碼方案, utf-8是unicode的一種實現方式。

Python2 編碼

In [1]: a = '啊哈哈'
In [2]: a
Out[2]: '\xe5\x95\x8a\xe5\x93\x88\xe5\x93\x88'
In [4]: type(a)
Out[4]: str
In [5]: len(a)
Out[5]: 9
In [6]: b = u'姚赫赫'
In [7]: type(b)
Out[7]: unicode
In [8]: len(b)
Out[8]: 3
In [9]: a.decode('utf-8')
Out[9]: u'\u554a\u54c8\u54c8'
In [10]: b Out[10]: u'\u59da\u8d6b\u8d6b' In [11]: b.encode('utf-8') Out[11]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab' In [12]: c = '姚赫赫' In [13]: c Out[13]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab' In [14]: import sys In [15]: sys.getdefaultencoding() Out[15]: 'ascii' In [16]: b + c --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-16
-c6b7c7e5694f> in <module>() ----> 1 b + c UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128) In [17]: import sys In [18]: relaod(sys) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-18-f73449e725b6> in <module>() ----> 1 relaod(sys) NameError: name '
relaod' is not defined In [19]: reload(sys) <module 'sys' (built-in)> In [20]: sys.setdefaultencoding('utf-8') In [21]: b + c Out[21]: u'\u59da\u8d6b\u8d6b\u59da\u8d6b\u8d6b' In [22]: type(b + c) Out[22]: unicode

python2 中a='啊哈哈', a的型別是str, 是編碼後的位元組序列。a的長度是位元組數;而b的型別是unicode(儲存文字字串), b的長度是字元數。


str –>decode(‘utf-8’) –> unicode
unicode –>encode(‘utf-8’)–> str


