1. 程式人生 > >In-depth introduction to bufio.Scanner in Golang

In-depth introduction to bufio.Scanner in Golang

In-depth introduction to bufio.Scanner in Golang

Go is shipped with package helping with buffered I/O — technique to optimize read or write operations. For writes it’s done by temporary storing data before transmitting it further (like disk or socket). Data is stored till certain size is reached. This way less write actions are triggered and each boils down to syscall which might be expensive when done frequently. For reads it means retrieving more data during single operation. It also reduces number of sycalls but can also uses underlaying hardware in more efficient way like reading data in disk blocks. This post focuses on

Scanner provided by bufio package. It helps to process stream of data by splitting it into tokens and removing space between them:

"foo  bar   baz"

If we’re are interested only in words then scanner helps retrieving “foo”, “bar” and “baz” in sequence (source code):

package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
input := "foo bar baz"
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

Output:

foo
bar
baz

Scanner uses buffered I/O while reading the stream — it takes io.Reader as an argument.

If you’re dealing with data in memory like string or slice of bytes then first check utilities like bytes.Split, strings.Split. It’s probably simpler to rely on those or others goodies from bytes or strings package when not working with data stream.

Under the hood scanner uses buffer to accumulate read data. When buffer is not empty or EOF has been reached then split function (SplitFunc) is called. So far we’ve seen one of pre-defined split functions but it’s possible to set anything with signature:

func(data []byte, atEOF bool) (advance int, token []byte, err error)

Split function is called with data read so far and basically can behave in 3 different ways — distinguished by returned values…

1. Give me more data!

It says that passed data is not enough to get a token. It’s done by returning 0, nil, nil. When it happens, scanner tries to read more data. If buffer is full then will double it before any reading. Let’s see how it works (source code):

package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
return 0, nil, nil
}
scanner.Split(split)
buf := make([]byte, 2)
scanner.Buffer(buf, bufio.MaxScanTokenSize)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
}

Output:

false	2	ab
false 4 abcd
false 8 abcdefgh
false 12 abcdefghijkl
true 12 abcdefghijkl

The above split function is very simple and greedy — always requesting for more data. Scanner will try to read more but also making sure that buffer has enough space. In our case we’re starting with buffer of size 2:

buf := make([]byte, 2)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

After split function is called for the very first time, scanner will double the size of the buffer, read more data and will call split function for the 2nd time. After 2nd call the scenario will be exactly the same. It’s visible in the output — first call of split gets slice of size 2, then 4, 8 and finally 12 since there is no more data.

Default size of buffer is 4096.

It’s worth to discuss atEOF parameter here. Designed to pass information to split function that no more data will be available. It can happen either while reaching EOF or if read call returns an error. If any of these happens then scanner will never try to read anymore. Such flag can used f.ex. to return error (because of incomplete token) which will cause scanner.Split() to return false and stop the whole process. Error can be later checked using Err method (source code):

package main
import (
"bufio"
"errors"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
if atEOF {
return 0, nil, errors.New("bad luck")
}
return 0, nil, nil
}
scanner.Split(split)
buf := make([]byte, 12)
scanner.Buffer(buf, bufio.MaxScanTokenSize)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("error: %s\n", scanner.Err())
}
}

Output:

false	12	abcdefghijkl
true 12 abcdefghijkl
error: bad luck

Parameter atEOF can be also used to process what is left inside buffer. One of pre-defined split functions which scans input line by line behaves exactly this way. For input like:

foo
bar
baz

there is no \n at the end of last line so when function ScanLines cannot find new line character it will simply return remaining characters as the last token (source code):

package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
input := "foo\nbar\nbaz"
scanner := bufio.NewScanner(strings.NewReader(input))
// Not actually needed since it’s a default split function.
scanner.Split(bufio.ScanLines)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

Output:

foo
bar
baz

2. Token found

This happens when split function was able to detect a token. It returns the number of characters to move forward in the buffer and the token itself. The reason to return two values is simply because token doesn’t have to be always equal to the number of bytes to move forward. If input is “foo foo foo” and when goal is to detect words (ScanWords), then split function will also skip over spaces in between:

(4, "foo")
(4, "foo")
(3, "foo")

Let’s see it in action. This function will look only for contiguous strings foo (source code):

package main
import (
"bufio"
"bytes"
"fmt"
"io"
"strings"
)
func main() {
input := "foofoofoo"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
if bytes.Equal(data[:3], []byte{'f', 'o', 'o'}) {
return 3, []byte{'F'}, nil
}
if atEOF {
return 0, nil, io.EOF
}
return 0, nil, nil
}
scanner.Split(split)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
}

Output:

F
F
F

3. Error

If split function returns an error then scanner stops (source code):

package main
import (
"bufio"
"errors"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
return 0, nil, errors.New("bad luck")
}
scanner.Split(split)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("error: %s\n", scanner.Err())
}
}

Output:

error: bad luck

There is one special error which doesn’t stop the scanner immediately….

ErrFinalToken

Scanner offers an option to signal so-called final token. It’s a special token which doesn’t break the loop (Scan still returns true) but subsequent calls to Scan will stop immediately (source code):

func (s *Scanner) Scan() bool {
if s.done {
return false
}
...

Proposed in #11836 and can be used to stop scanning when finding special token (source code):

package main
import (
"bufio"
"bytes"
"fmt"
"strings"
)
func split(data []byte, atEOF bool) (advance int, token []byte, err error) {
advance, token, err = bufio.ScanWords(data, atEOF)
if err == nil && token != nil && bytes.Equal(token, []byte{'e', 'n', 'd'}) {
return 0, []byte{'E', 'N', 'D'}, bufio.ErrFinalToken
}
return
}
func main() {
input := "foo end bar"
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Split(split)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("Error: %s\n", scanner.Err())
}
}

Output:

foo
END
Both io.EOF and ErrFinalToken aren’t considered to be “true” errors — Err method will return nil if any of these two caused scanner to stop.

相關推薦

In-depth introduction to bufio.Scanner in Golang

In-depth introduction to bufio.Scanner in GolangGo is shipped with package helping with buffered I/O — technique to optimize read or write operations. For

Introduction to bufio package in Golang

Introduction to bufio package in GolangPackage bufio helps with buffered I/O. Through a bunch of examples we’ll get familiar with goodies it provides: Read

An introduction to parsing text in Haskell with Parsec

util eof try xib reporting where its ner short Parsec makes parsing text very easy in Haskell. I write this as much for myself as for any

An Introduction to Clustering Algorithms in Python

An Introduction to Clustering Algorithms in PythonIn data science, we often think about how to use data to make predictions on new data points. This is cal

A Quick Introduction to Text Summarization in Machine Learning

A Quick Introduction to Text Summarization in Machine LearningText summarization refers to the technique of shortening long pieces of text. The intention i

Practical Introduction to Web Scraping in Python

Web Scraping Basics What is web scraping all about? Imagine that one day, out of the blue, you find yourself thinking “Gee, I wonder who the five most p

A Gentle Introduction to Exploding Gradients in Neural Networks

Tweet Share Share Google Plus Exploding gradients are a problem where large error gradients accu

Introduction to Yup Object Validation In React

An Introduction to Validation in React with YupThe JavaScript Object Schema Validator and Object Parser to use with ReactIn this article we will visit why

An Introduction to Using Form Elements in React

An Introduction to Using Form Elements in ReactReact form components 101: What you need to knowForms admittedly are not the most fun things to code in web

An Introduction to Testing in Go

Testing is hugely important in all software. Being able to ensure the correctness of your code and ensure that any changes you make don’t end up br

Introduction to Random Number Generators for Machine Learning in Python

Tweet Share Share Google Plus Randomness is a big part of machine learning. Randomness is used a

Gentle Introduction to Transduction in Machine Learning

Tweet Share Share Google Plus Transduction or transductive learning are terms you may come acros

[譯]Introduction to Concurrency in Spring Boot

當我們使用springboot構建服務的時候需要處理併發。一種錯誤的觀念認為由於使用了Servlets,它對於每個請求都分配一個執行緒來處理,所以就沒有必要考慮併發。在這篇文章中,我將提供一些建議,用於處理springboot中的多執行緒問題以及如何避免一些可能導致的情況。 spring boot 併發基礎

JQuery $.each遍歷JSON字符串報Uncaught TypeError:Cannot use 'in' operator to search for

error type tex clipboard function sans ica arch tools 查看一個簡單的jQuery的例子來遍歷一個JavaScript數組對象。 [js] view plaincopy var json = [ {"i

why does it suck to be an in-house programmer?

done lin programs man net soft control ams som Number one: you never get to do things the right way. You always have to do things the exp

DeepLearning to digit recognizer in kaggle

flags 權重 數據位 更新 multiple 就會 oss you 給定 DeepLearning to digit recongnizer in kaggle 近期在看deeplearning,於是就找了kaggle上字符識別進行練習。這裏我

How to Install wget in OS X如何在Mac OS X下安裝wget並解決configure: error:

configure openssl usr local 解壓 fix 官網下載 .org get 1.ftp://ftp.gnu.org/gnu/wget/官網下載最新的安裝包 wget-1.19.tar.gz 2.打開終端輸入 tar zxvf wget-1.9.1.ta

How to convert matrix to RDD[Vector] in spark

toarray kcon tex logs def supports iterator ati true The matrix is generated from SVD, and I am using the results from SVD to do clusteri

立足中國,走向世界(Made in China, Go to World)

hang 面向 fin com href pac 海外 企業 nbsp FineUI一路走來已經歷經 9 年的風風雨雨,擁有國內最為廣泛的捐贈群體(1500多位),和眾多企業客戶的青睞(200多家)。 今天,我們很高興的宣布:FineUI英文版上線了!

Async in depth

.org ask socket ack char ger mil rtu -m Writing I/O- and CPU-bound asynchronous code is straightforward using the .NET Task-based async m