CS48 Implementation of Programming Languages

LR(0)

Review

  • S/R sequence, recursive parsers

Automatic Parsing

Given a reasonable grammar, you already know how to build a S/R recursive parser. This technique is called top-down (or sometimes miscalled LL). There is another way (called bottom-up or LR). Both yacc and bison use it.

Simpler than Proposition

The following grammar is about the simplest natural grammar that requires lookahead.
P = D eof     r0
D = D | C     r1
D = C         r2
C = C & B     r3
C = B         r4
B = t         r5

The LR(0) Machine

There is a construction that leads to an LR(0) machine which is capable of producing a S/R sequence. Why this works is discussed later. For the time being, build the LR(0) machine for the grammar above. It will only partially work. It can be extended to be an LR(1) machine, which will work.

The construction is a kind of song with two alternating verses:

  • completion
  • transition
The purpose of the construction is to construct a set of sets of marked rules. The starting position is a set containing the left most rule marked (with .) in the leftmost place:

P = .D eof
Then sing the completion verse, add all rules defining any non-terminal immediately to the right of the mark, and marking them in the leftmost place.
0-------------+
| P = .D eof  |
| D = .D | C  |      +-+
| D = .C      |   =  |0|
| C = .C & B  |      +-+
| C = .B      |
| B = .t      |
+-------------+
In this case, every rule gets included. Usually that is not so. Box the set, and number it (0 in this case). If it has an entry with the "." at the right end of a rule, mark the box with "." corners. This will remind us that the state has a complete rule, and therefore calls for a reduce action. Then four verses of the transition song must be sung.

+-+    D   1-------------+
|0|--+---> | P = D .eof  |
+-+  |     | D = D .| C  |
     |     +-------------+
     |
     | C   2-------------.
     +---> | D = C.      |
     |     | C = C .& B  |
     |     .-------------.
     |
     | B   3-------------.
     +---> | C = B.      |
     |     .-------------.
     |
     | t   4-------------.
     +---> | B = t.      |
           .-------------.

In each case, the completion verse does not need to be sung because there are no non-terminals immediately to the right of the marker in any of the four new sets. Continuing, singing the transition verse in set 1 we get two new sets 5 and 6, and in set 6 sing the completion verse.

+-+    D  +-+    eof 5------------.
|0|--+--->|1|--+---->| P = D eof. |
+-+  |    +-+  |     .------------.
     |         |
     |         | |   6------------+
     .         +---->| D = D | .C |
     .               | C = .C & B |
                     | C = .B     |
                     | B = .t     |
                     +------------+

There are 3 transitions from set 6, only one of which yields a new set. Sets 3 and 4 have already shown up on transitions out of set 0. There is one transition out of set 7.

+-+    D  +-+    eof .-.
|0|--+--->|1|--+---->|5|
+-+  |    +-+  |     .-.
     |         |
     |         | |   +-+    C  7------------.
     .         +---->|6|--+--->| D = D | C. |
     .               +-+  |    | C = C .& B |
                          |    .------------.
                          |
                          | B  .-.
                          +--->|3|
                          |    .-.
                          |
                          | t  .-.
                          +--->|4|
                               .-.
+-+    D  +-+    eof .-.
|0|--+--->|1|--+---->|5|
+-+  |    +-+  |     .-.
     |         |
     |         | |   +-+    C  .-. &  8------------+
     .         +---->|6|--+--->|7|--->| C = C & .B |
     .               +-+  |    .-.    | B = .t     |
                          |           +------------+
                          |
                          | B  .-.
                          +--->|3|
                          |    .-.
                          |
                          | t  .-.
                          +--->|4|
                               .-.
+-+    D  +-+    eof .-.
|0|--+--->|1|--+---->|5|
+-+  |    +-+  |     .-.
     |         |
     |         | |   +-+    C  .-. &  +-+    B  9------------.
     .         +---->|6|--+--->|7|--->|8|--+--->| C = C & B. |
     .               +-+  |    .-.    +-+  |    .------------.
                          |                |    
                          |                | t  .-.
                          | B  .-.         +--->|4|
                          +--->|3|              .-.
                          |    .-.
                          |
                          | t  .-.
                          +--->|4|
                               .-.

Putting it all together... we have the LR(0) machine

+-+    D  +-+    eof .-.
|0|--+--->|1|--+---->|5|
+-+  |    +-+  |     .-.
     |         |
     |         | |   +-+    C  .-. &  +-+    B  .-.
     |         +---->|6|--+--->|7|--->|8|--+--->|9|
     |               +-+  |    .-.    +-+  |    .-.
     | C  .-. &  +-+      |                |    
     +--->|2|--->|8|      |                | t  .-.
     |    .-.    +-+      | B  .-.         +--->|4|
     |                    +--->|3|              .-.
     | B  .-.             |    .-.
     +--->|3|             |
     |    .-.             | t  .-.
     |                    +--->|4|
     | t  .-.                  .-.
     +--->|4|
          .-.

The LR(0) machine can be applied to an input string. If the language is not LR(0) (as was claimed above), the machine will fail in some circumstances. But, that failure can often be got around by two cheap tricks:
  1. apply 'see' from the table if you can
  2. otherwise replace the longest rhs (to the left of the last state digit) with its lhs.
At any moment in time, any symbol that has been 'seen' plus everything to the left of it is called the parse stack.

Every time a terminal symbol s is passed, there is a side effect of calling shift(s). Every time a rule r is applied, there is a side effect of calling reduce(r). The fact that there is no "see" entry for any "apply" row indicates that this an LR(0) machine (no lookahead). This LR(0) machine with the "cheap tricks" and a recursive parser yield the same S/R sequence.

In tabular form
(lookahead to be added later):

in see go to apply
state   state rule
0D 1  
0C 2  
0B 3  
0t 4  
1eof 5  
1| 6  
2& 8  
2    r2
3    r4
4    r5
5    r0
6C 7  
6B 3  
6t 4  
7& 8  
7    r1
8B 9  
8t 4  
9    r3
0t | t & t eof
0t4| t & t eofshift(t)
0B | t & t eofreduce(r5)
0B3| t & t eofreduce(r4)
0C | t & t eof
0C2| t & t eof
0D | t & t eofreduce(r2)
0D1| t & t eof
0D1|6t & t eofshift(|)
0D1|6t4 & t eofshift(t)
0D1|6B & t eofreduce(r5)
0D1|6B3 & t eof
0D1|6C & t eofreduce(r4)
0D1|6C7& t eof
0D1|6C7&8t eofshift(&)
0D2|6C7&8t4eofshift(t)
0D1|6C7&8B eofreduce(r5)
0D1|6C7&8B9eof
0D1|6C eofreduce(r4)
0D1|6C7eof
0D eofreduce(r1)
0D1eof
0D1eof5shift(eof)
0Preduce(r0) and quit

Lookahead

There are various ways to fill in the "see" column for lookahead of 1. One can pick lookahead out of the LR(0) machine. Or there is a way to pick lookahead out of the grammar. The lookahead is examined only when deciding to apply a rule. Once the lookahead is tabulated, the two cheap tricks above can be forgotten. In this case, only states 2 and 7 have LR(0) shift/reduce conflicts. For other reduce cases, there is no shift, so any lookahead will do. Given the correct lookahead symbols, the final table is:

in see go to apply
state   state rule
0D 1  
0C 2  
0B 3  
0t 4  
1eof 5  
1| 6  
2& 8  
2|   r2
2eof   r2
3    r4
4    r5
5    r0
6C 7  
6B 3  
6t 4  
7& 8  
7|   r1
7eof   r1
8B 9  
8t 4  
9    r3

Knuth originally proposed LR(k) for a lookahead of k symbols for each reduction. For languages in use in 1965, LR(1) tables were too large for memories of ordinary computers (for example, a max of 16K 48 bit words on a Burrough B5000). LR(2) tables were far larger. The field quickly developed memory-efficient versions of LR(1) technology. A technique called LALR(1) (used by yacc and bison) dominated.

Anyway, language designers now accept as a constraint that the user should have to look ahead no more than one symbol to figure out what the program text means and the way to check that property is to pass a parser generator like yacc or bison. Java and ISO C are LR(1) but are often implemented recursively.

Nowadays, however, Knuth's original LR(1) tables fit easily into memory. So, for this course, we avoid the clever optimizations such as LALR(1), and use LR(1) directly. The LR(1) machine construction is very similar to the LR(0) construction just completed. But it is not much fun to do by hand.

Tom Pennello implemented LR(*), which tries to be LR(0), and only in the points where it fails, tries LR(1), then LR(2), until either it succeeds or the user gets tired of waiting (LR(*) might take forever). This turned out to be a dead end for programming langauges because the main operator/operand structure rarely profit from lookahead more than 1.

Failure of LR(k)

Construct the LR(0) machine for the following (ambiguous) grammar

P = X
X = X X
X = x
You cannot pick the lookahead correctly no matter what mean you use because there are two correct answers for xxx and a deterministic parser will get only one of them.

Said differently, if a language is LR(k) for any k, it is unambiguous. Nice property.

Why this all works

It turns out that the parse stack is a regular language. The LR(0) machine is a DFA that recognizes it. The two-verse song is actually the application of NFA-to-DFA transformation. One can write down the NFA directly from the grammar. Of course there is a lot more detail. See Chapter 5 of my online book.

Roughly, the story goes as follows. Make NFA grammar from the original cfg. The terminals and non-terminals of the cfg grammar are terminals of the new grammar. Make a new bunch of non-terminals of the form [X=Y.Z]. The new NFA grammar corresponding to the cfg for P above is:

[P=.Deof] = D[P=D.eof]
[P=D.eof] = eof[P=Deof.]
[P=Deof.] = 
[P=.Deof] = [D=.D|C] 
[P=.Deof] = [D=.C] 
[D=.D|C]  = D[D=D.|C]
[D=D.|C]  = |[D=D|.C]
[D=D|.C]  = C[D=D|C.]
[D=D|C.]  =
[D=.C]    = C[D=C.]
[D=C.]    =
[D=.D|C]  = [D=.D|C]
[D=D|.C]  = [C=.C&B]
[D=.C]    = [C=.C&B]
[C=.C&B]  = C[C=C.&B]
[C=C.&B]  = &[C=C&.B]
[C=C&.B]  = B[C=C&B.]
[C=C&B.]  = 
[C=.C&B]  = [C=.C&B]
[C=C&.B]  = [B=.t]
[C=.B]    = B[C=B.]
[C=B.]    = 
[C=.B]    = [B=.t]
[B=.t]    = t[B=t.]
[B=t.]    = 

Apply the NFA to DFA transformation. That is the LR(0) machine.


Created: April 12, 2001
Last modified: March 26, 2007