1. 程式人生 > >Strings, bytes, runes and characters in Go

Strings, bytes, runes and characters in Go

23 October 2013

Introduction

The previous blog post explained how slices work in Go, using a number of examples to illustrate the mechanism behind their implementation. Building on that background, this post discusses strings in Go. At first, strings might seem too simple a topic for a blog post, but to use them well requires understanding not only how they work, but also the difference between a byte, a character, and a rune, the difference between Unicode and UTF-8, the difference between a string and a string literal, and other even more subtle distinctions.

One way to approach this topic is to think of it as an answer to the frequently asked question, "When I index a Go string at position n, why don't I get the nth character?" As you'll see, this question leads us to many details about how text works in the modern world.

What is a string?

Let's start with some basics.

In Go, a string is in effect a read-only slice of bytes. If you're at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we'll assume here that you have.

It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

Printing strings

Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce ugly output. The simple print statement

    fmt.Println(sample)

produces this mess (whose exact appearance varies with the environment):

��=� ⌘

To find out what that string really holds, we need to take it apart and examine the pieces. There are several ways to do this. The most obvious is to loop over its contents and pull out the bytes individually, as in this for loop:

    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }

As implied up front, indexing a string accesses individual bytes, not characters. We'll return to that topic in detail below. For now, let's stick with just the bytes. This is the output from the byte-by-byte loop:

bd b2 3d bc 20 e2 8c 98

Notice how the individual bytes match the hexadecimal escapes that defined the string.

A shorter way to generate presentable output for a messy string is to use the %x (hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes of the string as hexadecimal digits, two per byte.

    fmt.Printf("%x\n", sample)

Compare its output to that above:

bdb23dbc20e28c98

A nice trick is to use the "space" flag in that format, putting a space between the % and the x. Compare the format string used here to the one above,

    fmt.Printf("% x\n", sample)

and notice how the bytes come out with spaces between, making the result a little less imposing:

bd b2 3d bc 20 e2 8c 98

There's more. The %q (quoted) verb will escape any non-printable byte sequences in a string so the output is unambiguous.

    fmt.Printf("%q\n", sample)

This technique is handy when much of the string is intelligible as text but there are peculiarities to root out; it produces:

"\xbd\xb2=\xbc ⌘"

If we squint at that, we can see that buried in the noise is one ASCII equals sign, along with a regular space, and at the end appears the well-known Swedish "Place of Interest" symbol. That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes after the space (hex value 20): e2 8c 98.

If we are unfamiliar or confused by strange values in the string, we can use the "plus" flag to the %q verb. This flag causes the output to escape not only non-printable sequences, but also any non-ASCII bytes, all while interpreting UTF-8. The result is that it exposes the Unicode values of properly formatted UTF-8 that represents non-ASCII data in the string:

    fmt.Printf("%+q\n", sample)

With that format, the Unicode value of the Swedish symbol shows up as a \u escape:

"\xbd\xb2=\xbc \u2318"

These printing techiques are good to know when debugging the contents of strings, and will be handy in the discussion that follows. It's worth pointing out as well that all these methods behave exactly the same for byte slices as they do for strings.

Here's the full set of printing options we've listed, presented as a complete program you can run (and edit) right in the browser:

// +build OMIT

// Copyright 2013 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import "fmt"

func main() {
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

    fmt.Println("Println:")
    fmt.Println(sample)

    fmt.Println("Byte loop:")
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }
    fmt.Printf("\n")

    fmt.Println("Printf with %x:")
    fmt.Printf("%x\n", sample)

    fmt.Println("Printf with % x:")
    fmt.Printf("% x\n", sample)

    fmt.Println("Printf with %q:")
    fmt.Printf("%q\n", sample)

    fmt.Println("Printf with %+q:")
    fmt.Printf("%+q\n", sample)
}

[Exercise: Modify the examples above to use a slice of bytes instead of a string. Hint: Use a conversion to create the slice.]

[Exercise: Loop over the string using the %q format on each byte. What does the output tell you?]

UTF-8 and string literals

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes. That means that when we store a character value in a string, we store its byte-at-a-time representation. Let's look at a more controlled example to see how that happens.

Here's a simple program that prints a string constant with a single character three different ways, once as a plain string, once as an ASCII-only quoted string, and once as individual bytes in hexadecimal. To avoid any confusion, we create a "raw string", enclosed by back quotes, so it can contain only literal text. (Regular strings, enclosed by double quotes, can contain escape sequences as we showed above.)

// +build OMIT

// Copyright 2013 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import "fmt"

func main() {
    const placeOfInterest = `⌘`

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")
}

The output is:

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98

which reminds us that the Unicode character value U+2318, the "Place of Interest" symbol ⌘, is represented by the bytes e2 8c 98, and that those bytes are the UTF-8 encoding of the hexadecimal value 2318.

It may be obvious or it may be subtle, depending on your familiarity with UTF-8, but it's worth taking a moment to explain how the UTF-8 representation of the string was created. The simple fact is: it was created when the source code was written.

Source code in Go is defined to be UTF-8 text; no other representation is allowed. That implies that when, in the source code, we write the text

`⌘`

the text editor used to create the program places the UTF-8 encoding of the symbol ⌘ into the source text. When we print out the hexadecimal bytes, we're just dumping the data the editor placed in the file.

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes.

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

Code points, characters, and runes

We've been very careful so far in how we use the words "byte" and "character". That's partly because strings hold bytes, and partly because the idea of "character" is a little hard to define. The Unicode standard uses the term "code point" to refer to the item represented by a single value. The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more information about that code point, see its Unicode page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower case Latin letter 'A': a.

But what about the lower case grave-accented letter 'A', à? That's a character, and it's also a code point (U+00E0), but it has other representations. For example we can use the "combining" grave accent code point, U+0300, and attach it to the lower case letter a, U+0061, to create the same character à. In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least confusing, so we use it with care. To make things dependable, there are normalization techniques that guarantee that a given character is always represented by the same code points, but that subject takes us too far off the topic for now. A later blog post will explain how the Go libraries address normalization.

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.

The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression

'⌘'

is rune with integer value 0x2318.

To summarize, here are the salient points:

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.

Range loops

Besides the axiomatic detail that Go source code is UTF-8, there's really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

We've seen what happens with a regular for loop. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. Here's an example using yet another handy Printf format, %#U, which shows the code point's Unicode value and its printed representation:

// +build OMIT

// Copyright 2013 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import "fmt"

func main() {
    const nihongo = "日本語"
    for index, runeValue := range nihongo {
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }
}

The output shows how each code point occupies multiple bytes:

U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6

[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?) What happens to the iterations of the loop?]

Libraries

Go's standard library provides strong support for interpreting UTF-8 text. If a for range loop isn't sufficient for your purposes, chances are the facility you need is provided by a package in the library.

The most important such package is unicode/utf8, which contains helper routines to validate, disassemble, and reassemble UTF-8 strings. Here is a program equivalent to the for range example above, but using the DecodeRuneInString function from that package to do the work. The return values from the function are the rune and its width in UTF-8-encoded bytes.

// +build OMIT

// Copyright 2013 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
    const nihongo = "日本語"
    for i, w := 0, 0; i < len(nihongo); i += w {
        runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
        fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
        w = width
    }
}

Run it to see that it performs the same. The for range loop and DecodeRuneInString are defined to produce exactly the same iteration sequence.

Look at the documentation for the unicode/utf8 package to see what other facilities it provides.

Conclusion

To answer the question posed at the beginning: Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of "character" is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.

There's much more to say about Unicode, UTF-8, and the world of multilingual text processing, but it can wait for another post. For now, we hope you have a better understanding of how Go strings behave and that, although they may contain arbitrary bytes, UTF-8 is a central part of their design.

相關推薦

Strings, bytes, runes and characters in Go

23 October 2013 Introduction The previous blog post explained how slices work in Go, using a number of exa

How can AI, Blockchain And Machine Learning Go Hand-in-Hand?

The last few years have seen exponential growth in new technologies. It seems that the world is now opening up to new ideas and experiments. Exponential te

Face Detection in Go using OpenCV and MachineBox @ Alex Pliutau's Blog

I found a very nice developer-friendly project MachineBox, which provides some machine learning tools inside Docker Container, including fac

Working with Websockets and Socket.IO in Go

Note - This tutorial was written using Go version 1.9 and googollee/go-socket.io Websockets are something I find interesting in the sense that t

Reading And Writing To Files in Go

Within this tutorial, we are going to look at how you can effectively read and write to files within your filesystem using the go programming langu

Two Go Talks: "Lexical Scanning in Go" and "Cuddle: an App Engine Demo"

1 September 2011 On Tuesday night Rob Pike and Andrew Gerrand each presented at the Sydney Google Technology

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

clas mysq mic swd pass pan 天使 -1 root 今天使用MySQLdb往MySQL插入中文數據遇到一個異常: UnicodeEncodeError: ‘latin-1‘ codec can‘t encode characters in posit

python輸出字符串,UnicodeEncodeError: 'ascii' codec can't encode characters in position問題

bsp pytho unicode .com set style 字符串 position utf http://blog.sina.com.cn/s/blog_64a3795a01018vyp.html 參考於這個博主,我自己做一個筆記。 把一個列表轉換成字符串輸出的

UnicodeEncodeError: 'ascii' codec can't encode characters in

sheng server tde num 設計者 encode 重置 不起作用 .py 做爬蟲向文件寫入時,出現寫入錯誤UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in ............. 問題

[Preact] Use State and Props in the Component Render Function

cnblogs use method nic imp ima rop component end Preact offers, in addition to the regular component API from React, the ability to acces

python 編碼問題:'ascii' codec can't encode characters in position 的解決方案

解釋器 rac python 編碼 att 文件 tde pla pytho net 問題描述: Python在安裝時,默認的編碼是ascii,當程序中出現非ascii編碼時,python的處理常常會報這樣的錯UnicodeDecodeError: ‘ascii‘ co

HDU - 6011 Lotus and Characters

row mathjax con contain cout har its XML nes 仔細讀題,然後來一發大模擬! Lotus has nn kinds of characters,each kind of characters has a value and a am

Relationship between frequency and spatial in digital images

log 兩個 表示 title cal .com 關系 show tla 今天又復習了一遍<<Digital Image Processing>>的第四章,為了加深對頻域的理解,我自己用PS畫了一張圖。如下: 然後做FFT,得到頻譜圖如下:

About the diffrence of wait timed_wait and block in java

@override stack util except str void rgs dex interrupt import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.Reentra

【轉】Redundancy and Latency in Structured Buffer Use

list set actual about ast oat efi macros cte From:https://developer.nvidia.com/content/redundancy-and-latency-structured-buffer-use In a

Authentication and Authorization in ASP.NET Web API

module to server -h alter prop strong bar isa som ?You‘ve created a web API, but now you want to control access to it. In this series o

解決author波浪線Spellchecker inspection helps locate typos and misspelling in your code, comments and literals, and fix them in one click

博客 翻譯 cli 修復 and idea tro alt 拼寫檢查 自從把默認的頭註釋的author改成自己的名字以後越看越順眼,但是發現名字下面一直有個波浪線,強迫癥簡直不能忍。 然後當你把鼠標放上去,再點擊提示上的“more”,會看到下面的提示: Spellchec

The Usage of Lambda and Heap in the C++ STL

ner class eap cto con c++ stl nts been nta The Usage of Lambda and Heap in the C++ STL Heap In c++ STL, the heap had been implemented as

EncodeError: 'latin-1' codec can't encode characters in position 69-70: ordinal not in range(

utf8 http error: swd area 名稱 data- encode col UnicodeEncodeError: ‘latin-1‘ codec can‘t encode characters in position 69-70: ordinal not