|
File ExecutableGrammar.html Author McKeeman Copyright © 2008 index See also my Stanford Talk. Executable GrammarsExecutable IOGI have defined an IOG (input/output grammar) as an extension of a CFG. The name is chosen to emphasize the symmetry of input (what any CFG normally does) and output, which has been added. It is a central requirment that there are distinct sets of input symbols, output symbols, phrase names and meta symbols. The main point of the following is that an IOG is an executable code. It is obvious that grammar reduction rules can be mechanically applied, and that if the reductions lead to a parse, that parse is correct. What is less obvious is that the grammars defined here can be executed a character at a time. GEM is a "grammar executing machine." It is implemented as a C program, applying an IOG g (analogous to a stored program) to input text i and producing corresponding output text o (if any). o = GEM(i, g) ASCII NotationEach grammar g acceptable to GEM is a sequence of rules each of the form p=α; where p is a letter and α is a seqence of symbols. It is pleasant to read typeset versions of grammars, with different fonts for different kinds of symbols. GEM, on the other hand, requires ASCII input. That in turn requires some kind of convention to distinguish the kinds of symbols. I choose to leave phrase names as raw identifiers, quote input symbols with ', and output symbols with ". The output quotes could have been { and } to highlight the connection to YACC but that choice does not make the grammars easy to read. Anything that is not one of the above is a meta symbol. Restrictions on GEM grammar inputThe input and output symbols are restricted to single characters. Thus '1' signifies that the digit 1 is input, and "+" signifies that the character + is output. Left recursion, direct or indirect, is forbidden because it would cause GEM to overflow its stack. Whitespace cannot be used between symbols in grammar rules. A significant task is to use GEM to remove the restrictions above. Top-down CompilerGiven these restrictions, the IOG is a recursive descent parser in the sense that when GEM:
And, in addition, GEM makes user-selected output. Very Simple Examples
GEM Grammar-grammarsThe next example is a self-describing CFG acceptable as input to GEM (except whitespace is used as a kindness to the reader; you may ignore it). Each grammar rule gives one definition for a phrase name. Symbols = and ; are meta symbols. The order of the rules is significant. By convention, the first rule defines the goal symbol of the grammar. The rule below for r is self-describing. One might read it as: phrase name r is defined as a phrase name l (denoting a letter), followed by input symbol =, followed by phrase name f describing a formula, followed by input symbol ;. The grammar itself is also self-describing: g defines a perhaps empty sequence of rules; f defines a perhaps empty sequence of symbols.
In terms of formalities,
The CFG self can be extended to describe an IOG by adding two rules describing output (look for ").
Since neither of the preceding two CFGs makes output, we expect isempty(GEM(self, self)) to be true. The notable result is that GEM does not crash while working very hard to apply self to self. Predefined Character ClassesAs a practical matter, a little more structure is needed. There are eight boring grammars describing letters, digits, and ascii. There is nothing special about them: they can be appended to any interesting grammar but at considerable cost in verbosity. They are analogous to C functions isdigit, isletter, etc. Four of them are designed to recognize (and ignore) any character of its class. The other four are identical, except they pass the character to the output. Their names are
The effect of each CFG rule is to accept (and discard) a character. The effect of each IOG rule is to pass a character from input to output. D, L, d, and l are defined below. The first rule for D reads "if you see a zero, discard it from the input and send a zero to the output." The classes A and a are analogous, for all of ASCII except null.
ScanOne immediate consequence of the predefined phrase names is a convenience. GEM can be used to remove the (disallowed) whitespace from any input.
Since GEM always does first things first, an input blank is discarded without any corresponding output. A similar statement holds for a newline. Characters protected by quote marks are preserved. Everything else (but unmatched quotes) is passed from the input to the output. The whole grammar asciiIOG is appended (but not displayed here). For example all('xyz' == GEM('x y z', nowhite)) is true.Since the first argument to GEM always needs to be deblanked, is it convenient to define a function ∀t.scan(t)=GEM(t, nowhite) Note that a deblanked version of nowhite must be prepared by hand to start the process off. Here it is: g=pg;g=;p=' ';p=' ';p=IAI;p=OAO;p=A;I='''"'";O='"'""";asciiIOG Pretty PrintingThe ugly deblanked version of a grammar can be restored by a pretty printer.
The grammar above is an example of its own output. Similarly prettynowhite = GEM(nowhite, scan(pretty)) restores the nicely formatted version of nowhite above and leaves it in variable prettynowhite. The pretty printer is also a syntax checker for input acceptable by GEM. Grammar InversionThe input language to GEM is entirely symmetrical in the input and output. Systematically interchanging the input and output quotes turns a parser into an unparser. I suspect the inverter works if and only if the IOG produces a 1-1 mapping. On the other hand, I do not know under what conditions an IOG would produce a 1-1 mapping, so the insight is not immediately useful. The inverter fails to produce a useful grammar when applied to nowhite because where to put blanks back in is ambiguous (nowhite is not 1-1). The inverter applied to itself is still an inverter.
Right-associative Sum and DifferenceUsing right recursion, the following grammar defines right associative expressions. In this case the IOG rule numbers are output; the result of executing GEM is therefore the reduce sequence. The display of sum can use whitespace freely since function scan will remove it.
Running GEM('x+x-x', scan(sum)) gives MATLAB string'4443210' If the inverter is applied to sum, the resulting IOG turns the reduce sequence back into the original source.
inversesum = GEM(scan(sum), scan(invert)); is true. ExpressionsNormal left associative expressions can be described with right recursion. The trick is similar to one used in writing parsers in functional languages such as ML. This grammar turns an expression into postfix PFN. If the rule for parentheses is removed, this grammar is invertible.
For example all('xy3+*x7/-' == GEM(scan(postfix),'x*(y+3)-x/7')) is true. There is a similar, even simpler, grammar to make prefix PFN. The Long March towards more Powerful CompilersOne can extend nowhite to accept multiple-character input and output symbols. The trick is to let nowhite single up the characters so that GEM can deal with them. Here is the new version
Transforming Kleene * to IOGIt is non-intuitive to have to avoid left recursion. One solution is to add the Kleene * to the IOG. The new IOG can parse itself.
GEM knows nothing about the Kleene star, so what must be done to make progress is to transform Kleene-style grammars back to the original IOG form. One way is to replace each r* with a new symbol (say R) and add new rules R = rR; R =; Kleene + can be implemented by transforming a+ into aa*. and (with the proverbial hand-wave) so on. GEM itselfThis section introduces a grammar executing machine (GEM) that can execute the restricted form of IOGs described in the introduction. The trick is backtracking. And the trick of the trick is a symmetrical notation that can be executed forward (Parsing) or backward (Backtracking). The effect is to avoid saving state to carry out the backtracking. Most of the actual parsing time is spent in searching the grammar for the next rule to apply. For small grammars, the inefficiencies are not noticable. Running GEMThe implementation of GEM in mxcom is as an object rather than a function. The reason is to allow the bundling of the GEM grammars and GEM functions in the same M file. The examples above can be duplicated as follows:
>> G = gem(); % build the grammar executing machine object
>> scan = @(txt)G.run(txt, G.nowhite);
>> GEM = @(c,i)G.run(i, scan(c)); % always deblank c
>> G.nowhite
>> GEM('x y z', G.nowhite)
>> scan('x y z')
>> GEM(scan(G.self), G.self)
>> GEM(G.nowhite, G.pretty)
>> pretty = @(txt)GEM(txt, G.pretty);
>> pretty(scan(G.pretty))
>> GEM('x+x-x', G.sum)
>> invert = @(txt)GEM(txt, G.invert);
>> GEM(v, invert(g.sum))
>> GEM('x*(y+3)-x/7', G.postfix)
Using GEM can be frustrating. First off, the input grammars are "touchy." A single character error can cause GEM fits. To help, a trace facility is built in. The flag
>> G.run('x y z', G.nowhite, '-gemTrace')
produces voluminous output of GEM's internal flailing. Second, if the input has an error, all GEM knows how to do is backtrack until it has failed every possibility. Then GEM reports the failure and the high-water mark; not much guidance for the needy user. Third, the C code itself is "touchy." I suppose this is what Einstein meant when he said "Everything should be as simple as possible, but no simpler." GEM ImplementationThe translator of GEM is implemented in a C MEX file. Very little of C is used, primarily increment/decrement, dereference, switch and loops. Once the input, output and grammar strings have been suitably initialized, the machine executes in one of three modes.
While searching:
Final NotesSome closely related work called PEGs. There is some history for all of this. |