File ChangingParser.html    Author McKeeman    Copyright © 2007    index

Changing the xcom Parser

Preliminaries

In MATLAB, in the xcom directory, type (or cut from here and paste to the MATLAB command line)

>> help Parser.m
>> xcom -parseTrace -parseDump x:=1
or, to run the Parser standalone
>> load cfg cfg               % prepare to run Parser standalone
>> EOL = 10;                  % ASCII newline
>> lx = Lexer(cfg, ['x:=1;' EOL 'y:=x']);
>> pr = Parser(lx)
The public fields and methods are visible, so try
>> pr.dump()

Read the general description of parsing.

LR Parsing

The LR(1) parser is implemented in 30 or so lines in method LR1 of Parser.m. There is no reason to change these lines. Ever. The LRerror method, on the other hand, can always use improvement.

Recursive Parsing

The recursive recognition methods in Parser.m reflect the IOG for X. If you change the CFG, then the change needs to be propagated to the IOG. The recursive methods need to reflect the new/changed CFG rules.

Each recursive method calls one or more of accept, reject, step and reduce. Each call of step moves ahead in the token stream, changing variable tok, and reporting the shifted token. A call to accept first checks the current token is what is expected, calls shift if it is, and calls reject if it is not. A call to reject issues a diagnostic and aborts the compilation.

Immediately after a successful test for a token, for example

    if tok == rw.PLUS
the step method is always called to discard the current token and get the next one. Method accept calls step on its own.

Each call of reduce reports a reduction. The calls to reduce occur exactly at the place the output symbols indicate in the IOG. The combined effect is to build the shift/reduce sequence.

The only consumer of the shift/reduce sequence is the tree builder, and it needs no other information.

Hard Cases (the real world)

One can write a CFG that is difficult to parse recursively, even if it satisfies the LR(1) restrictions. On the other hand, the recursive author can cheat as necessary, basing decisions on any information that is available. The bottom-up parser needs correct tables or it fails.

The C if-else construct is ambiguous. Recursive parsers can handle it. LR(1) parsers only work for unambiguous languages, so they fail. Most commercial LR parser generators contain a built-in hack to handle if-else, and give usable tables without even a complaint.

The C declarator and abstract-declarator are very similar constructs, and are hard (but not impossible) to parse recursively. They do not cause the LR(1) parser any trouble at all.

The C typedef mechanism cannot be parsed without applying some magic. The fundamental problem is that the defined type identifier is no longer syntactically an identifier, but rather becomes a type name. Of course every C compiler has to solve this problem. A traditional (but structure-destroying) solution is to let the symbol table send information back upstream to the lexer so that the lexer can deliver a type token instead of an identifier token. There are better ways to deal with typedef.

C macros are that much harder again. Grammars just do not do the job when the source text is changing on the fly.