File Bottomup.html    Author McKeeman    Copyright © 2007    index

Table-driven Parsing

Read a short history of bottom-up parsing.

The LR(1) Actions

LR(1) parsing is based on precomputing a data structure that holds every possible parsing decision. The data structure is (conceptually) a rectangular matrix, indexed by a current state and the current input token. The entries of the matrix are one of shift, reduce or error. The starting state is 1. The entries of the matrix are used in a loop which repeats until the parse is completed or an error is found and diagnosed. A correct parse will have processed all of the input and end up in the final state corresponding to the grammar goal symbol

Each shift entry causes the parser to report the current token, push the new state and move ahead in the token stream. These shift entries are the DFA to recognize the parse stack, bottom to top, as though it were a sequence.

Each reduce entry represents the recognition of a grammar rule. The reduce entries are the final states of the DFA. When a final state is reached, the parser breaks out of the DFA and executes a non-DFA reduce action. The reducing rule is reported. The top elements of the stack contain states, each corresponding to a symbol the r.h.s. of the reducing rule. The reduce action causes the parser to pop the states matching the r.h.s. off the stack, compute the state-transition for the l.h.s. of the rule, and push this new state on the stack, re-enabling the DFA.

Coding the LR(1) Table

The LR(1) table contains small integer values. Zero represents a syntax error. A shift entry contains the new state. A reduce entry contains the rule to apply. In this implementation, reduce rules are distinguished from shifts by encoding the negative of the rule number.

For example, the simple grammar knuth3.cfg
  rules           numbers
G
  S eof              1
S
  a A d              2
  b A B              3
A 
  c A                4
  c                  5
B
  d                  6

gives rise to the LR(1) transition table
         a      b      c      d     eof     A      B      G      S
 1:      2      3      0      0      0      0      0      0      4
 2:      0      0      5      0      0      6      0      0      0
 3:      0      0      7      0      0      8      0      0      0
 4:      0      0      0      0      9      0      0      0      0
 5:      0      0      5     -5      0     10      0      0      0
 6:      0      0      0     11      0      0      0      0      0
 7:      0      0      7     -5     -5     12     -5      0      0
 8:      0      0      0     13     -7      0     14      0      0
 9:     -1     -1     -1     -1     -1     -1     -1     -1     -1
10:      0      0      0     -4      0      0      0      0      0
11:      0      0      0      0     -2      0      0      0      0
12:      0      0      0     -4     -4      0     -4      0      0
13:      0      0      0      0     -6      0      0      0      0
14:      0      0      0      0     -3      0      0      0      0
and the rule lengths

   rule number 1 2 3 4 5 6
   rule length 2 3 3 2 1 1

Using the LR(1) Tables

Suppose the input text is "a c d eof". Then the following actions take place. The application of a reduction takes two steps; one to apply the rule and one to compute the new state.

in state see new state rule state stack accumulated reports
1 a 2   1 a
2 c 5   2 1 a c
5 d   A = c 5 2 1 a c 5
2 A 6   2 1 a c 5
6 d 11   6 2 1 a c 5 d
11 eof   S = a A d 11 6 2 1 a c 5 d 2
1 S 4   1 a c 5 d 2
4 eof 9   4 1 a c 5 d 2 eof
9   quit G = S eof 9 4 1 a c 5 d 2 eof 1

Building the LR(1) Tables (operational viewpoint).

The LR(1) table construction deals with sets of marked rules, each with a lookahead set. The initial LR(1) set, corresponding to the LR(1) machine start state, contains the goal rule marked in the leftmost position. Its lookahead is empty. For the grammar knuth3 above, that would be the set

      { G = .S eof {} }

where the dot is the mark.

The construction alternates closing a state, then shifting, which creates new states. These new states must be closed. and new shifts computed. As the computation proceeds, the result is a growing set of sets of states of marked rules. The computation continues until there are no more new states.

A state is closed by (recursively) adding marked rules for all definitions of each phrase name immediately to the right of the mark. The newly added rules are always marked at the leftmost position. The lookahead for each new rule is all things that could follow it, which in turn is computed from the heads of its right context.

Shifts from a closed state are computed by moving the mark one place to the right. The shift is on the symbol over which the mark moves. This has the effect of separating the marked rules into small groups, each of which is the initial contents of a new state. The shifted rules carry their lookahead along.

The collection of states and shifts is a DFA that recognizes all possible parse stacks. If the mark has arrived at the right end of a rule, it signifies a final state of the automaton, and therefore a reduce action.

There can be several shifts and several reduce actions for one state. The requirement is that their lookahead sets be disjoint. If they are not, the LR(1) machine contains a shift/reduce conflict or a reduce/reduce conflict.

Read a description of table building.
Follow a worked example to build the LR(1) tables for a very simple expression grammar.
Look at a yacc grammar for X.

Building the LR(1) Tables (mathematical viewpoint).

Read Knuth's 1965 paper, On the Translation of Languages from Left to Right or McKeeman's powerpoint presentation, A New Look at LR(1) or the Dartmouth Undergraduate Compiler Course class notes.


Footnote:

I first learned bottom-up compiling as a research assistant to Niklaus Wirth at Stanford. I attended a presentation of LR(k) by Donald Knuth at Stanford well in advance of the material appearing in print.