1. 程式人生 > >Python: 在Unicode和普通字串之間轉換

Python: 在Unicode和普通字串之間轉換

1.1. 問題Problem

You need to deal with data that doesn't fit inthe ASCII character set.

你需要處理不適合用ASCII字符集表示的資料.

1.2. 解決Solution

Unicode strings can be encoded in plain stringsin a variety of ways, according to whichever encoding you choose:

Unicode字串可以用多種方式編碼為普通字串,依照你所選擇的編碼(encoding):

<!--Inject Script Filtered -->
Toggle linenumbers
   1 #將Unicode轉換成普通的Python字串:"編碼(encode)"

   2 unicodestring = u"Hello world"

   3 utf8string = unicodestring.encode("utf-8")

   4 asciistring = unicodestring.encode("ascii")

   5 isostring = unicodestring.encode("ISO-8859-1")

   6 utf16string = unicodestring.encode("utf-16"
)
7 8 9 #將普通的Python字串轉換成Unicode: "解碼(decode)" 10 plainstring1 = unicode(utf8string, "utf-8") 11 plainstring2 = unicode(asciistring, "ascii") 12 plainstring3 = unicode(isostring, "ISO-8859-1") 13 plainstring4 = unicode(utf16string, "utf-16") 14 15 assert plainstring1==
plainstring2==plainstring3==plainstring4

1.3. 討論Discussion

If you find yourself dealing with text thatcontains non-ASCII characters, you have to learn about Unicode梬hatit is, how it works, and how Python uses it.

如果你發現自己正在處理包含非ASCII碼字元的文字,你必須學習Unicode,關於它是什麼,如何工作,而且Python如何使用它。

Unicode is a big topic.Luckily, you don't needto know everything about Unicode to be able to solve real-worldproblems with it: a few basic bits of knowledge are enough.First,you must understand the difference between bytes and characters.Inolder, ASCII-centric languages and environments, bytes andcharacters are treated as the same thing.Since a byte can hold upto 256 values, these environments are limited to 256characters.Unicode, on the other hand, has tens of thousands ofcharacters.That means that each Unicode character takes more thanone byte, so you need to make the distinction between charactersand bytes.

Unicode是一個大的主題。幸運地,你並不需要知道關於Unicode碼的每件事,就能夠用它解決真實世界的問題:一些基本知識就夠了。首先,你得了解在位元組和字元之間的不同。原先,在以ASCII碼為中心的語言和環境中,位元組和字元被當做相同的事物。由於一個位元組只能有256個值,這些環境就受限為只支援256個字元。Unicode碼,另一方面,有數萬個字元,那意謂著每個Unicode字元佔用多個位元組,因此,你需要在字元和位元組之間作出區別。

Standard Python strings are really byte strings,and a Python character is really a byte.Other terms for thestandard Python type are "8-bit string" and "plain string.",In thisrecipe we will call them byte strings, to remind you of theirbyte-orientedness.

標準的Python字串確實是位元組字串,而且一個Python字元真的是一個位元組。換個術語,標準的Python字串型別的是 "8位字串(8-bit string)"和"普通字串(plainstring)". 在這一份配方中我們把它們稱作是位元組串(byte strings),並記住它們是基於位元組的。

Conversely, a Python Unicode character is anabstract object big enough to hold the character, analogous toPython's long integers.You don't have to worry about the internalrepresentation;the representation of Unicode characters becomes anissue only when you are trying to send them to some byte-orientedfunction, such as the write method for files or the send method fornetwork sockets.At that point, you must choose how to represent thecharacters as bytes.Converting from Unicode to a byte string iscalled encoding the string.Similarly, when you load Unicode stringsfrom a file, socket, or other byte-oriented object, you need todecode the strings from bytes to characters.

相反地,一個PythonUnicode碼字元是一個大到足夠支援(Unicode)字元的一個抽象物件,類似於Python中的長整數。 你不必要為內在的表示擔憂;只有當你正在嘗試把它們傳遞給給一些基於位元組的函式的時候,Unicode字元的表示變成一個議題,比如檔案的write方法或網路套接字的send方法。那時,你必須要選擇該如何表示這些(Unicode)字元為位元組。從Unicode碼到位元組串的轉換被叫做編碼。同樣地,當你從檔案,套接字或其他的基於位元組的物件中裝入一個Unicode字串的時候,你需要把位元組串解碼為(Unicode)字元。

There are many ways of converting Unicodeobjects to byte strings, each of which is called an encoding.For avariety of historical, political, and technical reasons, there isno one "right" encoding.Every encoding has a case-insensitive name,and that name is passed to the decode method as a parameter. Hereare a few you should know about:

將Unicode碼物件轉換成位元組串有許多方法,每個被稱為一個編碼(encoding)。由於多種歷史的,政治上的,和技術上的原因,沒有一個"正確的"編碼。每個編碼有一個大小寫無關的名字,而且那一個名字被作為一個叄數傳給解碼方法。這裡是一些你應該知道的:

  • The UTF-8 encoding can handle any Unicode character.It is alsobackward compatible with ASCII, so a pure ASCII file can also beconsidered a UTF-8 file, and a UTF-8 file that happens to use onlyASCII characters is identical to an ASCII file with the samecharacters.This property makes UTF-8 very backward-compatible,especially with older Unix tools.UTF-8 is far and away the dominantencoding on Unix.It's primary weakness is that it is fairlyinefficient for Eastern texts.
  • UTF-8編碼能處理任何的Unicode字元。它也是與ASCII碼向後相容的,因此一個純粹的ASCII碼檔案也能被考慮為一個UTF-8檔案,而且一個碰巧只使用ASCII碼字元的 UTF-8檔案和擁有同樣字元的ASCII碼檔案是相同的。這個特性使得UTF-8的向後相容性非常好,尤其使用較舊的Unix工具時。UTF-8 無疑地是在 Unix上的佔優勢的編碼。它主要的弱點是對東方文字是非常低效的。
  • The UTF-16 encoding is favored by Microsoftoperating systems and the Java environment.It is less efficient forWestern languages but more efficient for Eastern ones.A variant ofUTF-16 is sometimes known as UCS-2.
  • UTF-16編碼在微軟的作業系統和Java環境下受到偏愛。它對西方語言是比較低效,但對於東方語言是更有效率的。一個UTF-16 的變體有時叫作UCS-2 。
  • The ISO-8859 series of encodings are 256-characterASCII supersets.They cannot support all of the Unicodecharacters;they can support only some particular language or familyof languages.ISO-8859-1, also known as Latin-1, covers most WesternEuropean and African languages, but not Arabic.ISO-8859-2, alsoknown as Latin-2,covers many Eastern European languages such asHungarian and Polish.
  • ISO-8859編碼系列是256個字元的ASCII碼的超集。他們不能夠支援所有的Unicode碼字元;他們只能支援一些特別的語言或語言家族。ISO-8859-1,也既Latin-1,包括大多數的西歐和非洲語言,但是不含阿拉伯語。ISO-8859-2,也既Latin-2,包括許多東歐的語言,像是匈牙利語和波蘭語。

If you want to be able to encode all Unicodecharacters, you probably want to use UTF-8.You will probably needto deal with the other encodings only when you are handed data inthose encodings created by some other application.

如果你想要能夠編碼所有的Unicode碼字元,你或許想要使用UTF-8。只有當你需要處理那些由其他應用產生的其它編碼的資料時,你或許才需要處理其他編碼。