File ChangingLexer.html    Author McKeeman    Copyright © 2007    index

Changing the xcom Lexer

Preliminaries

In MATLAB, in the xcom directory, type (or cut from here and paste to the MATLAB command line)

>> help Lexer.m               % the Lexer help page
>> xcom -lexDump x:=1         % post-run dump of lexemes from xcom
and
>> load cfg cfg               % prepare to run Lexer standalone
>> EOL = 10;                  % ASCII newline
>> lx = Lexer(cfg, ['x:=1;' EOL 'y:=x']) % 2-line , 11-char program
>> lx.dump()
The variable lx above is a struct containing the fields and methods of the Lexer object; each can be accessed separately, as in
>> [ln,col] = lx.getLineCol(7)

Read the general description of lexing.

Generalities

The definitions of whitespace, identiers, integer literals, and real literals are common across most programming languages. The X comment is conventional except for the leading character, which can be changed in one spot if desired. Reserved words and operators are tabulated by Cfg.m so nothing special has to be done in Lexer.m.

The Lexer identifiers the kind of the next symbol by examining its first characters (usually one is enough). There are only a few cases, each behind an if or elseif in the main loop. The cases are independent. An identifier or operator identifier is first delimited. Then, at the bottom of the loop, the symbol is looked up in the reserved word table prepared by Cfg.m. If the symbol is in the reserved word table, the lexeme is reported as reserved, otherwise as an id or operator id.

White space

The lexer does nothing special with white space or comments -- it presumes that the consumer of the lexeme stream will discard meaningless tokens. In xcom, the parser has jackets around its access to tokens, allowing it to step over white space and comments.

Adding a Class of Symbols

The four classes of symbols (id, integer, real, whitespace) in X require explict code both in Cfg.m and Lexer.m. The names in the grammar are technically reserved words (not defined by any rule) so Cfg.m unreserves id, integer and real. Lexer.m must have a compensating piece of code to construct tokens for each such class of symbols.

To add another such symbol (say string), both Cfg.m and Lexer.m need to handle the additional case. The change in Cfg.m is trivial. In the Lexer another "elseif" has to be added in the main loop, looking for the initial character of the string. The string delimiters are, of course, part of the string, so as to preserve the invariant that catenation of the sequence of lexemes yields the original source text.