1. 程式人生 > >Python中unicode編碼的字串和其他格式的字串之間進行轉換

Python中unicode編碼的字串和其他格式的字串之間進行轉換

1.1. 問題 Problem

You need to deal with data that doesn't fit in the ASCII character set.

你需要處理不適合用ASCII字符集表示的資料.

1.2. 解決 Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

Unicode字串可以用多種方式編碼為普通字串, 依照你所選擇的編碼(encoding):

<!-- Inject Script Filtered --> 
Toggle line numbers
#將Unicode轉換成普通的Python字串:"編碼(encode)" 
unicodestring = u"Hello world" 
utf8string = unicodestring.encode("utf-8") 
asciistring = unicodestring.encode("ascii") 
isostring = unicodestring.encode("ISO-8859-1") 
utf16string = unicodestring.encode("utf-16") 
 
 
#將普通的Python字串轉換成Unicode: "解碼(decode)"
 
10 plainstring1 = unicode(utf8string, "utf-8") 
11 plainstring2 = unicode(asciistring, "ascii") 
12 plainstring3 = unicode(isostring, "ISO-8859-1") 
13 plainstring4 = unicode(utf16string, "utf-16") 
14  
15 assert plainstring1==plainstring2==plainstring3==plainstring4

1.3. 討論 Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode梬hat it is, how it works, and how Python uses it.

如果你發現自己正在處理包含非ASCII碼字元的文字, 你必須學習Unicode,關於它是什麼,如何工作,而且Python如何使用它。

Unicode is a big topic.Luckily, you don't need to know everything about Unicode to be able to solve real-world problems with it: a few basic bits of knowledge are enough.First, you must understand the difference between bytes and characters.In older, ASCII-centric languages and environments, bytes and characters are treated as the same thing.Since a byte can hold up to 256 values, these environments are limited to 256 characters.Unicode, on the other hand, has tens of thousands of characters.That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes.

Unicode是一個大的主題。幸運地,你並不需要知道關於Unicode碼的每件事,就能夠用它解決真 實世界的問題: 一些基本知識就夠了。首先,你得了解在位元組和字元之間的不同。原先,在以ASCII碼為中心的語言和環境中,位元組和字元被當做相同的事物。由於一個位元組只 能有256個值,這些環境就受限為只支援 256個字元。Unicode碼,另一方面,有數萬個字元,那意謂著每個Unicode字元佔用多個位元組,因此,你需要在字元和位元組之間作出區別。

Standard Python strings are really byte strings, and a Python character is really a byte.Other terms for the standard Python type are "8-bit string" and "plain string.",In this recipe we will call them byte strings, to remind you of their byte-orientedness.

標準的Python字串確實是位元組字串,而且一個Python字元真的是一個位元組。換個術語,標準的 Python字串型別的是 "8位字串(8-bit string)"和"普通字串(plain string)". 在這一份配方中我們把它們稱作是位元組串(byte strings), 並記住它們是基於位元組的。

Conversely, a Python Unicode character is an abstract object big enough to hold the character, analogous to Python's long integers.You don't have to worry about the internal representation;the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method for files or the send method for network sockets.At that point, you must choose how to represent the characters as bytes.Converting from Unicode to a byte string is called encoding the string.Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.

相反地,一個Python Unicode碼字元是一個大到足夠支援(Unicode)字元的一個抽象物件, 類似於Python中的長整數。 你不必要為內在的表示擔憂; 只有當你正在嘗試把它們傳遞給給一些基於位元組的函式的時候,Unicode字元的表示變成一個議題, 比如檔案的write方法或網路套接字的send 方法。那時,你必須要選擇該如何表示這些(Unicode)字元為位元組。從Unicode碼到位元組串的轉換被叫做編碼。同樣地,當你從檔案,套接字或其他 的基於位元組的物件中裝入一個Unicode字串的時候,你需要把位元組串解碼為(Unicode)字元。

There are many ways of converting Unicode objects to byte strings, each of which is called an encoding.For a variety of historical, political, and technical reasons, there is no one "right" encoding.Every encoding has a case-insensitive name, and that name is passed to the decode method as a parameter. Here are a few you should know about:

將Unicode碼物件轉換成位元組串有許多方法, 每個被稱為一個編碼(encoding)。由於多種歷史的,政治上的,和技術上的原因,沒有一個 "正確的"編碼。每個編碼有一個大小寫無關的名字,而且那一個名字被作為一個叄數傳給解碼方法。這裡是一些你應該知道的:

  • The UTF-8 encoding can handle any Unicode character.It is also backward compatible with ASCII, so a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters.This property makes UTF-8 very backward-compatible, especially with older Unix tools.UTF-8 is far and away the dominant encoding on Unix.It's primary weakness is that it is fairly inefficient for Eastern texts.
  • UTF-8 編碼能處理任何的Unicode字元。它也是與ASCII碼向後相容的,因此一個純粹的ASCII碼檔案也能被考慮為一個UTF-8 檔案,而且一個碰巧只使用ASCII碼字元的 UTF-8 檔案和擁有同樣字元的ASCII碼檔案是相同的。 這個特性使得UTF-8的向後相容性非常好,尤其使用較舊的 Unix工具時。UTF-8 無疑地是在 Unix 上的佔優勢的編碼。它主要的弱點是對東方文字是非常低效的。
  • The UTF-16 encoding is favored by Microsoft operating systems and the Java environment.It is less efficient for Western languages but more efficient for Eastern ones.A variant of UTF-16 is sometimes known as UCS-2.
  • UTF-16 編碼在微軟的作業系統和Java環境下受到偏愛。它對西方語言是比較低效,但對於東方語言是更有效率的。一個 UTF-16 的變體有時叫作UCS-2 。
  • The ISO-8859 series of encodings are 256-character ASCII supersets.They cannot support all of the Unicode characters;they can support only some particular language or family of languages.ISO-8859-1, also known as Latin-1, covers most Western European and African languages, but not Arabic.ISO-8859-2, also known as Latin-2,covers many Eastern European languages such as Hungarian and Polish.
  • ISO-8859編碼系列是256個字元的ASCII碼的超集。他們不能夠支援所有的Unicode碼字元; 他們只能支援一些特別的語言或語言家族。ISO-8859-1, 也既Latin-1,包括大多數的西歐和非洲語言, 但是不含阿拉伯語。ISO-8859-2,也既Latin-2,包括許多東歐的語言,像是匈牙利語和波蘭語。

If you want to be able to encode all Unicode characters, you probably want to use UTF-8.You will probably need to deal with the other encodings only when you are handed data in those encodings created by some other application.

如果你想要能夠編碼所有的Unicode碼字元,你或許想要使用UTF-8。只有當你需要處理那些由其他應用產生的其它編碼的資料時,你或許才需要處理其他編碼。