1. 程式人生 > >Using Parser Combinators in Go

Using Parser Combinators in Go

Using Parser Combinators in Go

Let’s parse in Golang!

How would you write a parser for the following calculator grammar?

Digit := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" Number       := Digit+Multiplicand := Number              | "(" ^ Expression ^ ")"Addend       := Multiplicand              
| Multiplicand ^ "*" ^ AddendExpression := Addend | Addend ^ "+" ^ Expression

There are many options: ANTLR, Yacc, the shunting-yard algorithm etc. Very often they involve code generation or an algorithm that looks nothing like the grammar. If you’re interested in learning an easier and more flexible way of parsing, then you’ll like this blog post and the next one.

I first read about this technique in the book “ML for the working programmer”. I recommend this book to every programmer.

We’re going to explore a very simple technique called parser combinators that can handle many LL grammars. On our way we’re going to get caught by every major pitfall. The grammar example above has been purposely designed in such a way that the naive approach fails. In the next blog post we’ll find out what’s going wrong and how to avoid the pitfalls to get a working parser.

If you’re just interested in the final result of this and the next blog post, then you can look at the calculator example code on GitHub or have look at the result on the Go Playground.

What’s a Parser?

A parser is a function from input to output.

The input of a parser is a sequence of characters. Our characters are plain runes, i. e. Unicode code points.

The output of a parser consists of any result along with the rest of the input. In case parsing fails, the result will be nil. In production code we would use error messages explaining how to fix the input instead of a nil result.

Now we already know enough to write a simple parser. Let’s start with the parser for the Digit from our example grammar.

We can run our parser online on The Go Playground.

The parser ExpectDigit is rather ugly! It contains much boiler-plate code. On top of that it looks very different from the grammar. Enough reasons for us to try out another way to implement it.

How to Combine Parsers?

Each operator from the BNF (Backus-Naur form) becomes a method that combines parsers.

+---------+----------------+-------------------------------+| Grammar |  Method call   |          Description          |+---------+----------------+-------------------------------+| p | q   | p.OrElse(q)    | Try parsing with p.           ||         |                | If parsing fails, try q.      || p ^ q   | p.AndThen(q)   | Parse with p.                 ||         |                | Parse remaining input with q. || p+      | p.OnceOrMore() | Repeat parsing with p.        ||         |                |                               || "0"     | Expect('0')    | Succeed iff the input         ||         |                | starts with this character.   ||  p?     | p.Optional()   | Parse with p or succeed       ||         |                | without parsing anything.     || n/a     | p.Convert(f)   | Apply the function f to the   ||         |                | result of the parser p.       |+---------+----------------+-------------------------------+

We’re going to discuss the usage of the parser combinators in the following and refer to the github repository Parser-Gombinators for their implementation.

Alternatives

Let’s assume we already have the parser Expect(character rune) that only succeeds if it can find a given character at the beginning of the input. Using this parser, the "0" from the calculator grammar becomes Expect('0') in the implementation. The Implementation of ExpectDigit is going to involve Expect('0'), Expect('1') etc. How exactly do we combine them?

The answer is to see the alternative operator | as a method that combines parsers. We call this method OrElse. Using OrElse we can read off the implementation from the grammar.

We can read the code out aloud. It’s a bit repetetive but we’ll leave it like this for now. The code is just as repetitive as the grammar.

Repetition

Let’s give the repetition operator + from the grammar the name OnceOrMore. With the help of OnceOrMore the code for the Number parser looks almost like its grammar.

Converting Results

We haven’t thought about the output of parsers very much until now. So what’s the result of the Number parser above? If we look at the playground example, we’ll find that it’s a list of runes. That’s probably not the best data structure to store numbers. However, given a list of runes, we can easily convert it to a number. We can write a function listToInt(l *list.List) int that does the conversion for us. The method Convert helps us apply the function to the result of a parser. First, we need to change the signature of our converting function to listToInt(l interface{}) interface{}. Secondly, we use Convert as follows.

As a side node: In Java the method Convert would convert a Parser<List<Character>> into a Parser<Integer>. Go doesn’t have Generics so we can’t see the effect of Convert in the types.

Concatenation

The last thing we’re missing is the concatenation operator ^ from the grammar. We’ll call it AndThen. It combines two parsers and stores both results in a pair.

We have all the combinators we need, at last. It looks like we’re now ready to implement the whole grammar for our calculator. For now we’ll only implement the Multiplicand.

Conclusion

U̶s̶i̶n̶g̶ ̶p̶a̶r̶s̶e̶r̶ ̶c̶o̶m̶b̶i̶n̶a̶t̶o̶r̶s̶ ̶i̶s̶ ̶s̶t̶r̶a̶i̶g̶h̶t̶-̶f̶o̶r̶w̶a̶r̶d̶ ̶a̶n̶d̶ ̶w̶e̶ ̶l̶e̶a̶v̶e̶ ̶t̶h̶e̶ ̶i̶m̶p̶l̶e̶m̶e̶n̶t̶a̶t̶i̶o̶n̶ ̶o̶f̶ ̶t̶h̶e̶ ̶r̶e̶s̶t̶ ̶o̶f̶ ̶t̶h̶e̶ ̶p̶a̶r̶s̶e̶r̶ ̶f̶o̶r̶ ̶o̶u̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶g̶r̶a̶m̶m̶a̶r̶ ̶t̶o̶ ̶t̶h̶e̶ ̶r̶e̶a̶d̶e̶r̶.̶ If only it was this simple.

Using parser combinators is not that easy. Until now we only know the basic idea behind them. We have a naive intuition of how to use them but there are some Gotchas that will make the naive approach fail. Read about these and one more Bonus Gotcha in the next blog post!