A Simple Compiler in MATLAB
by Bill McKeeman
Updated December 28, 2005
A Simple Compiler in M
Contents
Generalities
Complicated input (such as C or M or XML) is usually best processed using grammar-based tools and algorithms. The result is a fairly regular and maintainable input program and also an internal representation of the input upon which to base subsequent processing. The prototypical example is translating a programming language (such as M) into an executable form (such as pcode or hardware opcodes).
The following code was first presented in the 10-lecture in-house compiler course at MathWorks in 2000.
Grammar
Compilers are always designed around a grammar describing the input language. Here is the grammar for the simple compiler. It describes a sequence of assignment statements. The assignments may contain arithmetic operators, parentheses, variables and numbers.
rule numbers program stmts 1 stmts stmt 1 stmts stmt 2 stmt assign ; 1 assign var = expr 1 var id 1 expr term 1 expr + term 2 expr - term 3 term factor 1 term * factor 2 term / factor 3 term \ factor 4 factor ( expr ) 1 var 2 int 3
Front End
The input is traditionally separated into two phases, lexical analysis and parsing. The lexer analyzes the word and punctuation structure. Significant items, called tokens, are sent to the parser. Other items, such as white space and comments, are discarded.
The parser applies grammar rules and produces a so-called shift-reduce sequence. This comes about as tokens are pushed onto the parser stack (called shifts) and, when the top of the stack contains the the right-hand-side of a grammar rule, the rhs is replaced with the name of the grammar rule (called a reduction). When the input has all been consumed and when parse stack contains the just grammar goal symbol, the parse succeeds.
Except for the discarded lexical information, no information has been lost. That is, the input can be recreated from the shift-reduce sequence.
There are two competing ways of implementing parsers (called bottom-up and top-down). This presentation is about top-down parsers.
Lexer
The two front end components can be tested separately. Function lexer.m is called by lexer_test. The input is an arbitrary test text.
lexer_test
lexing a b1 Cc d_1 0 12 12c ()*/\+-=;
a token code: 2
b1 token code: 2
Cc token code: 2
d_1 token code: 2
0 token code: 3
12 token code: 3
12 token code: 3
c token code: 2
( token code: 4
) token code: 5
* token code: 6
/ token code: 7
\ token code: 8
+ token code: 9
- token code: 10
= token code: 11
; token code: 12
token code: 1
The lexer code itself is
dbtype lexer
1 % FILE: lexer.m
2 % PURPOSE: Simple lexer
3 % CALL: [tcode, tstart, tend] = lexer(src)
4 % SIG: [[uint8], [uint16], [uint16]] = lexer([char])
5 % Accepts M-like source text and produces three arrays
6 % tcode(i) is a unique code for the i-th token.
7 % src(tstart(i):tend(i)) is the text of the i-th token.
8
9 function [tcode, tstart, tend] = lexer(src)
10
11 % Initialize character and token tables.
12 TOK = enum(tokens);
13
14 EOF = uint8(128);
15 isdigit(1:EOF) = false; isdigit('0':'9') = true;
16 isalnum = isdigit; isalnum('a':'z') = true; isalnum('A':'Z') = true;
17
18 ch2code(EOF) = 0; % allocate fast table
19 ch2code('=') = TOK.eq; % for single char tokens
20 ch2code('+') = TOK.add;
21 ch2code('-') = TOK.sub;
22 ch2code('*') = TOK.mul;
23 ch2code('/') = TOK.div;
24 ch2code('\') = TOK.vid;
25 ch2code('(') = TOK.lp;
26 ch2code(')') = TOK.rp;
27 ch2code(';') = TOK.semi;
28
29 % Allocate storage for the token sequences
30 len = uint16(length(src)); % more than enough
31 tcode = zeros(len,1, 'uint8'); % preallocate
32 tstart = zeros(len,1, 'uint16');
33 tend = zeros(len,1, 'uint16');
34
35 ti = uint16(0); % token index
36 ci = uint16(1); % character index
37 src = [src EOF]; % append eof
38
39 % Lex the input
40 while true
41 st = ci;
42 ch = src(ci);
43
44 if ch<1 | ch>EOF
45 error(['unexpected input character ' num2str(ch)]);
46 elseif isspace(ch) % blank
47 ci = ci+1;
48 continue; % ignore whitespace
49 elseif isletter(ch)
50 while isalnum(ch) || ch == '_' % identifier
51 ci = ci+1;
52 ch = src(ci);
53 end
54 tok = TOK.id;
55 elseif isdigit(ch) % number
56 while isdigit(ch)
57 ci = ci+1;
58 ch = src(ci);
59 end
60 tok = TOK.int;
61 elseif ch2code(ch) ~= 0 % single char tokens
62 ci = ci+1;
63 tok = ch2code(ch);
64 elseif ch == EOF % eof
65 tok = TOK.eof;
66 else
67 error(['unexpected input character ''' ch '''']);
68 end
69
70 if ti == numel(tcode) % need more storage
71 tcode(end+end) = 0; % double it
72 tstart(end+end) = 0;
73 tend(end+end) = 0;
74 end
75 ti = ti+1; % make room
76 tcode(ti) = tok; % record token info
77 tstart(ti) = st;
78 tend(ti) = ci-1; % ci starts next token
79 if tok == TOK.eof, break; end % quit
80 end
81
82 % Clean up
83 tcode = tcode(1:ti); % trim results
84 tstart = tstart(1:ti);
85 tend = tend(1:ti);
86
Helper Functions
The lexer depends on two simple helper functions that insure that the lexer and parser use the same tables.
dbtype tokens dbtype enum
1 % FILE: tokens.m
2 % PURPOSE: List of token names for simple lexer
3
4 function names = tokens
5 names = {
6 'eof', ...
7 'id', ... xyz
8 'int', ... 123
9 'lp', ... (
10 'rp', ... )
11 'mul', ... *
12 'div', ... /
13 'vid', ... \
14 'add', ... +
15 'sub', ... -
16 'eq', ... =
17 'semi' ... ;
18 };
1 % FILE: enum.m
2 % PURPOSE: Implement M version of enum
3 % METHOD: Fill struct with names for 1:n
4 % USAGE: >> E = enum({'a', 'b', ...});
5 % >> if E.a == ...
6
7 function res = enum(names)
8 res = struct;
9 for i =1:numel(names)
10 res.(names{i}) = uint8(i); % only 255 of them
11 end
enum(tokens)
ans =
eof: 1
id: 2
int: 3
lp: 4
rp: 5
mul: 6
div: 7
vid: 8
add: 9
sub: 10
eq: 11
semi: 12
Parser
Function parser.m is called by parser_test. In this case the input is syntactically correct. The parser calls the lexer for each token.
parser_test
parsing a=1000; b=(a+((1))); longid=1+2-3*4/5\6;
a=1000; b=(a+((1))); longid=1+2-3*4/5\6;
shift 'a'
rule 7
shift '='
shift '1000'
rule 17
rule 11
rule 8
shift ';'
rule 6
rule 5
rule 3
shift 'b'
rule 7
shift '='
shift '('
shift 'a'
rule 7
rule 16
rule 11
rule 8
shift '+'
shift '('
shift '('
shift '1'
rule 17
rule 11
rule 8
shift ')'
rule 15
rule 11
rule 8
shift ')'
rule 15
rule 11
rule 9
shift ')'
rule 15
rule 11
rule 8
shift ';'
rule 6
rule 5
rule 4
shift 'longid'
rule 7
shift '='
shift '1'
rule 17
rule 11
rule 8
shift '+'
shift '2'
rule 17
rule 11
rule 9
shift '-'
shift '3'
rule 17
rule 11
shift '*'
shift '4'
rule 17
rule 12
shift '/'
shift '5'
rule 17
rule 13
shift '\'
shift '6'
rule 17
rule 14
rule 10
shift ';'
rule 6
rule 5
rule 4
rule 2
The parser code relies on the ability to nest functions in M. The recursive parse routines must share state. GLOBAL could be used but the local variables of the parser are a much more elegant solution, thanks to Mike Karr.
dbtype parser
1 % FILE: parser.m
2 % PURPOSE: Example recursive M subset parser in M
3 % METHOD: Recursive descent
4 %
5 % GRAMMAR: M subset
6 % rule numbers
7 % program
8 % stmts 1
9 % stmts
10 % stmt 1
11 % stmts stmt 2
12 % stmt
13 % assign ; 1
14 % assign
15 % var = expr 1
16 % var
17 % id 1
18 % expr
19 % term 1
20 % expr + term 2
21 % expr - term 3
22 % term
23 % factor 1
24 % term * factor 2
25 % term / factor 3
26 % term \ factor 4
27 % factor
28 % ( expr ) 1
29 % var 2
30 % int 3
31
32 function sr = parser(tc)
33 % sr: shift-reduce sequence
34 % tc: token code sequence
35
36 % Establish nested variables
37 TOK = enum(tokens); % enumerations
38 RULE = enum(rules);
39 toki = uint16(1); % token index
40 sr = ones(100,1,'uint8'); % preallocate sr sequence
41 sri = uint16(0); % sr index
42
43 program; % parse input (this is IT)
44
45 sr = sr(1:sri); % trim exactly
46
47
48 % ----------- nested recursive function definitions ----------------
49
50 function program % program
51 stmts;
52 reduce(RULE.program1);
53 end
54
55 function stmts % stmts
56 stmt;
57 reduce(RULE.stmts1);
58 while tc(toki) ~= TOK.eof
59 stmt;
60 reduce(RULE.stmts2);
61 end
62 end
63
64 function stmt % stmt
65 assign;
66 reduce(RULE.stmt1);
67 end
68
69 function assign % assign
70 var;
71 if tc(toki) == TOK.eq
72 shift; % discard =
73 expr;
74 else
75 parseerror('expected =, ', tc(toki));
76 end
77 if tc(toki) == TOK.semi
78 shift; % discard ;
79 else
80 parseerror('expected ;, ', tc(toki));
81 end
82 reduce(RULE.assign1);
83 end
84
85 function var % var
86 if tc(toki) == TOK.id
87 shift; % record id
88 reduce(RULE.var1);
89 else
90 parseerror('expected var, ', tc(toki));
91 end
92 end
93
94 function expr % expr
95 term;
96 reduce(RULE.expr1);
97 u = tc(toki);
98 while u == TOK.add || u == TOK.sub
99 shift; % discard + or -
100 r = RULE.expr2; % usual case
101 if u == TOK.sub, r = RULE.expr3; end
102 term;
103 reduce(r);
104 u = tc(toki);
105 end
106 end
107
108 function term % term
109 factor;
110 reduce(RULE.term1);
111 t = tc(toki);
112 while t == TOK.mul || t == TOK.div || t == TOK.vid
113 shift; % discard * / \
114 r = RULE.term2; % usual case
115 if t == TOK.div, r = RULE.term3; end
116 if t == TOK.vid, r = RULE.term4; end
117 factor;
118 reduce(r);
119 t = tc(toki);
120 end
121 end
122
123 function factor % factor
124 w = tc(toki);
125 if w == TOK.lp
126 shift; % discard (
127 expr;
128 if tc(toki) == TOK.rp
129 shift; % discard )
130 else
131 parseerror('expected ), ', tc(toki));
132 end
133 reduce(RULE.factor1);
134 elseif w == TOK.id
135 var;
136 reduce(RULE.factor2);
137 elseif w == TOK.int
138 shift; % record int
139 reduce(RULE.factor3);
140 else
141 parseerror('unexpected token, ', w);
142 end
143 end
144
145 function shift % shift
146 appd(RULE.shift); % special shift marker
147 appd(toki); % index of token
148 toki = toki+1; % discard token
149 end
150
151 function reduce(rule) % reduce
152 appd(rule);
153 end
154
155 function appd(data) % append to sr sequence
156 if sri >= length(sr)
157 sr(end+end) = 0; % double allocation
158 end
159 sri = sri+1; % make room
160 sr(sri) = data; % place data
161 end
162
163 function parseerror(msg, found)
164 disp('***Parse error, sr sequence:');
165 rulenames = rules;
166 tokennames = tokens;
167 i = 1;
168 while true % dump sr sequence so far
169 if i >= sri, break; end
170 t = sr(i);
171 if t == RULE.shift % special shift marker
172 i = i+1; % move to token index
173 disp([' shift ' tokennames{tc(sr(i))}]);
174 else
175 disp([' reduce ' rulenames{t}]);
176 end
177 i = i+1;
178 end
179
180 error([msg 'found ' tokennames{found}]);
181 end
182
183 end % of parse
184
More Helper Functions
Like the lexer, the parser needs a helper function to pass information on to the next stage.
dbtype rules
1 % FILE: rules.m
2 % PURPOSE: Rule names for simple parser
3
4 function names = rules
5 names = { ...
6 'shift', ...
7 'program1', 'stmts1', 'stmts2', ...
8 'stmt1', 'assign1', 'var1', ...
9 'expr1', 'expr2', 'expr3', ...
10 'term1', 'term2', 'term3', 'term4', ...
11 'factor1', 'factor2', 'factor3'};
enum(rules)
ans =
shift: 1
program1: 2
stmts1: 3
stmts2: 4
stmt1: 5
assign1: 6
var1: 7
expr1: 8
expr2: 9
expr3: 10
term1: 11
term2: 12
term3: 13
term4: 14
factor1: 15
factor2: 16
factor3: 17
Back End
It is somewhat surprising that the shift/reduce sequence contains the Reverse Polish (pcode) sequence. It is a byproduct of the way grammars are written (left to right association). The generator depends on this, merely needing to extract the sequence and reformat as Intel assembly. The unit test for the generator gives the following outout.
gen_test
a=1000; b=(a+((1))); longid=1+2-3*4/5\6; ENTRY: fld =1000 fstp a fld a fld =1 fadd fstp b fld =1 fld =2 fadd fld =3 fld =4 fmul fld =5 fdiv fld =6 fdivr fsub fstp longid
Here is the generator.
dbtype gen
1 % FILE: gen.m
2 % PURPOSE: Sample x86 code generator for type double
3 % CALL: asm = gen(src, ts, te, sr)
4 % SIG: [char] = gen([char], [uint16], [uint16], [uint16]))
5 % Accepts source text, token markers and shift/reduce sequence
6 % Produces assembly level x86 code
7 % USAGE: See gen_test.m for example of use
8
9 function asm = gen(src, tc, ts, te, sr)
10
11 TOK = enum(tokens); % enumerations
12 RULE = enum(rules);
13
14 sri = uint16(1); % shift/reduce index
15 symi = uint16(0); % symbol stack index
16 EOL = 10; % ASCII newline character
17 asm = ['ENTRY:' EOL]; % the assembly language
18
19 while(true)
20 switch sr(sri) % rule number
21 case RULE.shift
22 sri = sri+1; % move to token val
23 t = tc(sr(sri)); % token code
24 if t == TOK.id || t == TOK.int % significant?
25 symi = symi+1;
26 syms(symi) = sr(sri); % stack token
27 end
28
29 case RULE.program1 % quit
30 break;
31
32 case RULE.stmts1
33 case RULE.stmts2
34 case RULE.stmt1
35
36 case RULE.assign1 % x=exp;
37 v = syms(symi);
38 symi = symi-1;
39 asm = [asm ' fstp ' src(ts(v):te(v)) EOL];
40 case RULE.var1
41
42 case RULE.expr1
43
44 case RULE.expr2 % a+b
45 asm = [asm ' fadd' EOL];
46 case RULE.expr3 % a-b
47 asm = [asm ' fsub' EOL];
48 case RULE.term1
49
50 case RULE.term2 % a*b
51 asm = [asm ' fmul' EOL];
52 case RULE.term3 % a/b
53 asm = [asm ' fdiv' EOL];
54 case RULE.term4 % a\b
55 asm = [asm ' fdivr' EOL];
56 case RULE.factor1
57
58 case RULE.factor2 % abc
59 v = syms(symi);
60 symi = symi-1;
61 asm = [asm ' fld ' src(ts(v):te(v)) EOL];
62 case RULE.factor3 % 123
63 v = syms(symi);
64 symi = symi-1;
65 asm = [asm ' fld =' src(ts(v):te(v)) EOL];
66 otherwise
67 error('unknown rule');
68 end
69 sri = sri + 1; % next shift/reduce
70 end
71
Bottom Line
That's all folks. You could grab these three modules and their helpers and write your own translator today.
bill = imread('bill.jpg'); image(bill); set(gca, 'outerposition', [0,0,.3,.5]) axis equal; axis off