1. 程式人生 > >Tutorial: Write a Finite State Machine to parse a custom language in pure Python

Tutorial: Write a Finite State Machine to parse a custom language in pure Python

1. Analyze the structure

First a couple simple examples of the POSH Syntax one per line (3 examples):

VB(noise+3)
NNS(acoustics) & RB(not)
(NNS(acoustics) & RB(not)) | JJ(muted)

Let’s not dive too deep into the meaning of my syntax. Instead we are going to focus on the stream of characters and how they may be parsed. A couple quick observations:

  • There is a prefix like “VB” then a parenthesis set.
  • Inside the parenthesis is a subject
  • each of those prefix + ( + subject + ) we can call a Rule
  • then there may be an operator that strings rules together like ‘&’ or ‘|’
  • sometimes there are groups of things also encapsulated by a parenthesis.
  • if I look left to right, character per character, I observe what options from that spot to the next will modify state

2. Draw a state diagram

It’s easiest for me if I simply start with one state and I ask myself, “self, where can I go from here?” For example if start with the state PREFIX while looking at the very first character of the first example the ‘V’ (in “VB(noise+3)”) I see I can only go two places, 1) to another character, in this case the ‘B’ or 2) I can go to the first ‘(‘ parenthesis. If I graph this out it looks like this:

Here I am saying when I go to the next character I am either still in the prefix or I move to the subject state if I ever to see the “(“ character.

We need to do this for every single state. We just keep going character per character and ask ourselves what are the possible options and what are the results of those options. Then end diagram looks like this:

POSH Syntax State Diagram with transitions

Notice I have given each transition a name, each state as well.

A note on coverage: This is where TDD (test driven development) comes it. A later goal will be to try to cover (run a “coverage” report) every transition and every state.

3. List our transitions and states and convert them to code

For the transitions I will give each a short all caps name then a lower case longer name (so I can implement those in Python later):

T_SKIP = transition_skipT_NEW_GROUP = transition_new_groupT_APPEND_CHAR_PRE = transition_append_preT_ADD_OP = transition_add_opT_ADD_OP_NEW_RULE = transition_add_op_new_ruleT_END_GROUP = transition_end_groupT_END_RULE = transition_end_ruleT_APPEND_CHAR_SUBJ = transition_append_subjT_ADD_GROUP_OP = transition_add_op_new_group

Now for the states, give each one a short name and then a longer full text name to help identify them.

S_NEW_GROUP = “STATE: NEW_GROUP”S_END_GROUP = “STATE: END_GROUP”S_PRE = “STATE: PREFIX”S_OP = “STATE: OPERATOR”S_END_RULE = “STATE: END_RULE”S_SUBJ = “STATE: SUBJECT”

4. Create a transition table of state changes

Our transition table must contain one entry for each:

  • start state (src)
  • end state (dst)
  • rule for transition (condition)
  • callback for the transition (callback)

In Python it will look something like this:

# For transition 1{‘src’: S_NEW_GROUP, ‘dst’: S_PRE, ‘condition’: “[A-Za-z|+|-|\d]”, ‘callback’: T_APPEND_CHAR_PRE} 

Now we number all of our possible stage changes so we are sure we don’t miss one.

POSH Syntax State Diagram with transitions

The end table looks something like this:

5. Complete our program class design

We start with three class:

  1. A Rule class: It will hold our Prefix, Suffix, and operator (left hand side)
  2. A RuleGroup: This will have a parent RuleGroup — poor man’s pushdown automata
  3. And the Rule_Parse_FSM class that main goal is to iterate over the input string and handle the transitions while callling the callbacks.

Rule Class:

RuleGroup Class:

And finally the Rule_Parse_FSM class:

A quick overview of the parse logic:

  • Run take the input string character by character and sends it to process next.
  • process takes only the relevant transitions (those with matching states to current state) and sends the character to an evaluator
  • The evaluator (iterate_re_evaluators) takes the cmd logic (which happens to be a very short regex statement) I stress Very short and not confusing, I hope. If it matches it fires a state change
  • The state changer stores a new stage and calls the call backs

Now we need callbacks that do something different depending on the transition from state to to state. We already outlined them but we need to now write each. Here is what we end up with:

It’s very convenient to pass in the fsm_obj which is an instance of Rule_Parse_FSM class. Each transition does something different. Let’s take transition_add_op for example. That is when we hit a operator like the ’&’ or the ‘|’ we find the Rule it belongs to (which will eventually be the complete left hand side of the rule) and assign that operator to the instance.

6. Test our Program

The complete program may be located here.

Running this files gives us:

N -> STATE: NEW_GROUP : STATE: PREFIXN -> STATE: PREFIX : STATE: PREFIX( -> STATE: PREFIX : STATE: SUBJECTh -> STATE: SUBJECT : STATE: SUBJECTe -> STATE: SUBJECT : STATE: SUBJECTl -> STATE: SUBJECT : STATE: SUBJECTl -> STATE: SUBJECT : STATE: SUBJECTo -> STATE: SUBJECT : STATE: SUBJECT) -> STATE: SUBJECT : STATE: END_RULEskip ' ' in STATE: END_RULE& -> STATE: END_RULE : STATE: OPERATORskip ' ' in STATE: OPERATORN -> STATE: OPERATOR : STATE: PREFIXN -> STATE: PREFIX : STATE: PREFIX( -> STATE: PREFIX : STATE: SUBJECTw -> STATE: SUBJECT : STATE: SUBJECTo -> STATE: SUBJECT : STATE: SUBJECTr -> STATE: SUBJECT : STATE: SUBJECTl -> STATE: SUBJECT : STATE: SUBJECTd -> STATE: SUBJECT : STATE: SUBJECT) -> STATE: SUBJECT : STATE: END_RULE
<RuleGroup: {'rules': [<Rule:  NN(hello)>, <Rule: & NN(world)>], 'level': 0, 'rule_count': 2, 'parent': None, 'op': None}>

Now we have completely implemented our state machine in 200 lines of Python code. Yay! ?

Again the final code may be located HERE