In-depth introduction to bufio.Scanner in Golang

阿新 • • 發佈：2019-01-15

In-depth introduction to bufio.Scanner in Golang

Go is shipped with package helping with buffered I/O — technique to optimize read or write operations. For writes it’s done by temporary storing data before transmitting it further (like disk or socket). Data is stored till certain size is reached. This way less write actions are triggered and each boils down to syscall which might be expensive when done frequently. For reads it means retrieving more data during single operation. It also reduces number of sycalls but can also uses underlaying hardware in more efficient way like reading data in disk blocks. This post focuses on

Scanner provided by bufio package. It helps to process stream of data by splitting it into tokens and removing space between them:

"foo  bar   baz"

If we’re are interested only in words then scanner helps retrieving “foo”, “bar” and “baz” in sequence (source code):

package main

import (
    "bufio"
    "fmt"
    "strings"
)

func main() {
    input := "foo   bar      baz"
    scanner := bufio.NewScanner(strings.NewReader(input))
    scanner.Split(bufio.ScanWords)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
}

Output:

foo
bar
baz

Scanner uses buffered I/O while reading the stream — it takes io.Reader as an argument.

If you’re dealing with data in memory like string or slice of bytes then first check utilities like bytes.Split, strings.Split. It’s probably simpler to rely on those or others goodies from bytes or strings package when not working with data stream.

Under the hood scanner uses buffer to accumulate read data. When buffer is not empty or EOF has been reached then split function (SplitFunc) is called. So far we’ve seen one of pre-defined split functions but it’s possible to set anything with signature:

func(data []byte, atEOF bool) (advance int, token []byte, err error)

Split function is called with data read so far and basically can behave in 3 different ways — distinguished by returned values…

1. Give me more data!

It says that passed data is not enough to get a token. It’s done by returning 0, nil, nil. When it happens, scanner tries to read more data. If buffer is full then will double it before any reading. Let’s see how it works (source code):

package main

import (
    "bufio"
    "fmt"
    "strings"
)

func main() {
    input := "abcdefghijkl"
    scanner := bufio.NewScanner(strings.NewReader(input))
    split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
        return 0, nil, nil
    }
    scanner.Split(split)
    buf := make([]byte, 2)
    scanner.Buffer(buf, bufio.MaxScanTokenSize)
    for scanner.Scan() {
        fmt.Printf("%s\n", scanner.Text())
    }
}

Output:

false	2	ab
false	4	abcd
false	8	abcdefgh
false	12	abcdefghijkl
true	12	abcdefghijkl

The above split function is very simple and greedy — always requesting for more data. Scanner will try to read more but also making sure that buffer has enough space. In our case we’re starting with buffer of size 2:

buf := make([]byte, 2)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

After split function is called for the very first time, scanner will double the size of the buffer, read more data and will call split function for the 2nd time. After 2nd call the scenario will be exactly the same. It’s visible in the output — first call of split gets slice of size 2, then 4, 8 and finally 12 since there is no more data.

Default size of buffer is 4096.

It’s worth to discuss atEOF parameter here. Designed to pass information to split function that no more data will be available. It can happen either while reaching EOF or if read call returns an error. If any of these happens then scanner will never try to read anymore. Such flag can used f.ex. to return error (because of incomplete token) which will cause scanner.Split() to return false and stop the whole process. Error can be later checked using Err method (source code):

package main

import (
    "bufio"
    "errors"
    "fmt"
    "strings"
)

func main() {
    input := "abcdefghijkl"
    scanner := bufio.NewScanner(strings.NewReader(input))
    split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
        if atEOF {
            return 0, nil, errors.New("bad luck")
        }
        return 0, nil, nil
    }
    scanner.Split(split)
    buf := make([]byte, 12)
    scanner.Buffer(buf, bufio.MaxScanTokenSize)
    for scanner.Scan() {
        fmt.Printf("%s\n", scanner.Text())
    }
    if scanner.Err() != nil {
        fmt.Printf("error: %s\n", scanner.Err())
    }
}

Output:

false	12	abcdefghijkl
true	12	abcdefghijkl
error: bad luck

Parameter atEOF can be also used to process what is left inside buffer. One of pre-defined split functions which scans input line by line behaves exactly this way. For input like:

foo
bar
baz

there is no \n at the end of last line so when function ScanLines cannot find new line character it will simply return remaining characters as the last token (source code):

package main

import (
    "bufio"
    "fmt"
    "strings"
)

func main() {
    input := "foo\nbar\nbaz"
    scanner := bufio.NewScanner(strings.NewReader(input))
    // Not actually needed since it’s a default split function.
    scanner.Split(bufio.ScanLines)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
}

Output:

foo
bar
baz

2. Token found

This happens when split function was able to detect a token. It returns the number of characters to move forward in the buffer and the token itself. The reason to return two values is simply because token doesn’t have to be always equal to the number of bytes to move forward. If input is “foo foo foo” and when goal is to detect words (ScanWords), then split function will also skip over spaces in between:

(4, "foo")
(4, "foo")
(3, "foo")

Let’s see it in action. This function will look only for contiguous strings foo (source code):

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "io"
    "strings"
)

func main() {
    input := "foofoofoo"
    scanner := bufio.NewScanner(strings.NewReader(input))
    split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        if bytes.Equal(data[:3], []byte{'f', 'o', 'o'}) {
            return 3, []byte{'F'}, nil
        }
        if atEOF {
            return 0, nil, io.EOF
        }
        return 0, nil, nil
    }
    scanner.Split(split)
    for scanner.Scan() {
        fmt.Printf("%s\n", scanner.Text())
    }
}

Output:

F
F
F

3. Error

If split function returns an error then scanner stops (source code):

package main

import (
    "bufio"
    "errors"
    "fmt"
    "strings"
)

func main() {
    input := "abcdefghijkl"
    scanner := bufio.NewScanner(strings.NewReader(input))
    split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        return 0, nil, errors.New("bad luck")
    }
    scanner.Split(split)
    for scanner.Scan() {
        fmt.Printf("%s\n", scanner.Text())
    }
    if scanner.Err() != nil {
        fmt.Printf("error: %s\n", scanner.Err())
    }
}

Output:

error: bad luck

There is one special error which doesn’t stop the scanner immediately….

ErrFinalToken

Scanner offers an option to signal so-called final token. It’s a special token which doesn’t break the loop (Scan still returns true) but subsequent calls to Scan will stop immediately (source code):

func (s *Scanner) Scan() bool {
    if s.done {
  	return false
    }
    ...

Proposed in #11836 and can be used to stop scanning when finding special token (source code):

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "strings"
)

func split(data []byte, atEOF bool) (advance int, token []byte, err error) {
    advance, token, err = bufio.ScanWords(data, atEOF)
    if err == nil && token != nil && bytes.Equal(token, []byte{'e', 'n', 'd'}) {
        return 0, []byte{'E', 'N', 'D'}, bufio.ErrFinalToken
    }
    return
}

func main() {
    input := "foo end bar"
    scanner := bufio.NewScanner(strings.NewReader(input))
    scanner.Split(split)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
    if scanner.Err() != nil {
        fmt.Printf("Error: %s\n", scanner.Err())
    }
}

Output:

foo
END

Both io.EOF and ErrFinalToken aren’t considered to be “true” errors — Err method will return nil if any of these two caused scanner to stop.

In-depth introduction to bufio.Scanner in Golang

In-depth introduction to bufio.Scanner in Golang

1. Give me more data!

2. Token found

3. Error

ErrFinalToken

In-depth introduction to bufio.Scanner in Golang

Introduction to bufio package in Golang

An introduction to parsing text in Haskell with Parsec

An Introduction to Clustering Algorithms in Python

A Quick Introduction to Text Summarization in Machine Learning

Practical Introduction to Web Scraping in Python

A Gentle Introduction to Exploding Gradients in Neural Networks

Introduction to Yup Object Validation In React

An Introduction to Using Form Elements in React

An Introduction to Testing in Go

Introduction to Random Number Generators for Machine Learning in Python

Gentle Introduction to Transduction in Machine Learning

[譯]Introduction to Concurrency in Spring Boot

JQuery $.each遍歷JSON字符串報Uncaught TypeError:Cannot use 'in' operator to search for

why does it suck to be an in-house programmer?

DeepLearning to digit recognizer in kaggle

How to Install wget in OS X如何在Mac OS X下安裝wget並解決configure: error:

How to convert matrix to RDD[Vector] in spark

立足中國，走向世界（Made in China, Go to World）

Async in depth

In-depth introduction to bufio.Scanner in Golang

In-depth introduction to bufio.Scanner in Golang

1. Give me more data!

2. Token found

3. Error

ErrFinalToken

相關推薦