File ChangingX.html    Author McKeeman    Copyright © 2007    index

Changing the X CFG

Changing X

The first action in changing the X language is changing the X CFG. The initial X CFG describes an implementable language.

A new construct is added to X by adding one or more rules to the CFG. The new rule(s) must also be implementable, which means that corresponding changes to the rest of the compiler must be possible. Knowing how to design implementable rules is an art that will improve with practice. You should, of course, use a new name for your new CFG. If the Marx Brothers did a language project, they might have called their grammar CHGGZ.cfg.

The simplest way to start is to pick a rule that "is like" the others, then carry through the implementation, as in the warm up exercise. After awhile, you can be more adventurous because you will better understand the implications of your choices.

The Original X CFG

The easiest way to get familiar with the X CFG is to run

>> tryCfg();
One of the trials reads and dumps the result of CFG analysis for X.cfg.

Perhaps even better, do the test steps by hand one at a time:

>> grtxt = xread('X.cfg')
You will see the raw text of the current X.cfg. Then try
>> grobj = Cfg(grtxt)
You will see the Cfg object (which is itself not very interesting). The object output does, however, hint at more information. The sizes of V, Vi, V, and the reserved word count are explicit. The difference in size between Vi and reserved is explained by the three "special" input symbols id, integer and real, which are not reserved names, but rather "reserved classes" handled later by the lexer. If you want all the data in one go, just type
>> grobj.dump()
otherwise just look at one public object field at a time, as in
>> grobj.Vn

Everyday Cfgs

The format of an everyday cfg is given by three simple rules.

  1. A single symbol starting in column 1 is a phrase name; it determines the current rule; subsequent lines are definitions.
  2. A line starting with a blank (or an empty line) is a phrase definition for the current rule.
  3. The end of file terminates the cfg (watch out for unintended empty lines at the end).
For example, the following 4 (four) definitions are used for stmt, including the empty statement.
stmt

  selection
  iteration
  assignment

Manufactured Rule Names

The rule names are of particular interest. Try

>> grobj.ruleNames
What you will see is manufactured rule names, one for each rule of the CFG. The Cfg object makes the names by catenating all the characters in the rule, inserting an underbar ('_') between the phrase name and each definition, and using character identifiers for each non letter. The result is a name that is visually identifiable with a particular grammar rule, and will not change unless that grammar rule itself changes. This is a plus when you are changing rules during development. The trickery, a kind of name mangling, is tucked away in object idCtor.m.

Adding Rule(s) to X.cfg

The typical change to X.cfg adds a few rules. You might want to look at the warm up exercise for an example. The format of X.cfg is straightforward. It is an everyday grammar.

A new reserved word or operator is added by including it in a right-hand side of some rule. The Cfg object classifies it as an input symbol because it does not appear in the left margin (and therefore becoming a phrase name). You must, of course, keep your Vi and Vn separate.

After every change, it is a good idea to make a Cfg object and look at the results to confirm that the changes are what you desire.

Adding Good Rule(s) to X.cfg

Conforming to the layout of X.cfg and satisfying Cfg.m is straightforward. Making a usable cfg takes more skill. Read the advice about language design. Start small.

Adding a Data Structure

The most obvious data structure is a vector. Following the C tradition one might imagine writing

  x := y[i];
  y[i+1] := 13.1;
  y := [1.1,2.0,-rand];
Any of these three lines should cause y to be entered into the symbol table as a real vector.

Adding a Phrase Name for a Class of Symbols (e.g. string)

Suppose you want to implement strings, and need 'xyz' be a token in the same way that 123 is a token. You use the symbol string in the CFG in one or more rules. If you do nothing else, string will be a reserved word since it does not appear on the left. There is a place in Cfg.m where the reserved word table is built. It has exceptions for id, integer and real. Add string to the exception list.

Note: it is common in grammar input languages to use a special form to indicate reserved classes. For example, <ID> or <STRING>. If you like that convention, you can implement it.

In any case, the lexer will have to be changed to classify and collect the information for the new class of symbols. See the treatment of identifiers for an analogous case.

Making the CFG tables

The CFG tables only need to be made when you change X.cfg. If you are just experimenting and do not want to clobber the tables being used by xcom, run

>> makeCfg X.cfg
and be prepared to wait a minute for the LR tables. If you get an error-free run and you do want to use the tables, run
>> makeCfg -saveMat X.cfg
The newly computed tables will be put into a file cfg.mat. xcom uses cfg.mat. xcom will warn if cfg.mat is out of date with respect to X.cfg.

You do not have to use the LR tables if you do not want to (see the discussion on bottom up vs. top down parsers), but it a good idea to make a grammar that is LR(1). If you are determined to press ahead without satisfying the LR(1) constraints, run

>> makeCfg -noLR -saveMat X.cfg

Changing the Recursive Parser

See details here.

LR errors

It can be tedious to get a cfg to conform to the LR(1) constraints. At some point I will add better LR(1) failure diagnostics.