|
This is an answer to one of the regular expression exercises. The C comment, /*...*/ can be described by a regular expression. Do so. Prove that your solution describes all, and only, C comments. Don't peek unless you are done working on the problem!| V | V | V | V | V | V | V | V | V | V | V | V Answer by McKeemanFrom the C standard 6.4.9...Except within a character constant, a string literal, or a comment, the characters /* introduce a comment. The contents of a comment are examined only to identify multibyte characters and to find the characters */ that terminate it. One must be a bit of a C lawyer here. /* introduce a comment. OK, then I infer the /* are not part of the contents of the comment. On the other hand, the terminating */ are a part of the contents because they are found when examining the contents. Because the definition would make no sense otherwise, I also infer that there are no earlier terminating */ in the contents. It is this sloppily worded definition which we are asked to implement and prove. We take advantage of the ill-definition by merely requiring our proofs to be convincing. First, some simplifications. The * in C gets tangled up with the Kleene star of regular expressions and except for / and *, all the other characters in a C comment are equivalent. Therefore we introduce a 3-element universe of
U: / x .
where / stands for itself, x stands for the ASCII
*, and . stands for all the other characters in C.
For example
/x...x/
is a C comment in our artificial universe.
The rest of ASCII can now be used for regular expression meta characters.
Suggestion 1 (by Doug McIlroy) /x~(U*x/U*)x/This can be read as "a C comment is /x followed by anything but a string that contains x/, followed by x/". This suggestion not exactly an answer because "not such and such a string" is not a regular expression. But it is elegant, obviously correct, and leads to other solutions. Suggestion 2 (DFA) The spine of the diagram points down. The transitions on the right point up. State 4 is final. The table of triples gives the same information; it is actually a CFG for a DFA.
go in see
0 1 0 /
/| 2 1 x
1 2 2 /
x| 2 2 .
+--+--+--+ 3 2 x
2 /| .| | 3 3 x
+--+--+ | 2 3 .
x| .| 4 3 /
+--+ |
3 x| |
+--+-----+
/|
4
One reads the diagram as "in state 0, if you see /, eat it and transition to state 1..." This suggestion is not exactly an answer either since a DFA is only equivalent to a regular expression. Suggestion 3 (regular grammar) Using the everyday notation for grammars, we can describe the DFA with rules that have exactly 2 or zero symbols on the RHS. The rules can only be applied to a string that already contains a nonterminal (e.g. a state). So there is a special rule for 0 that is created out of nothing to get things started. 0 1 0 / 2 1 x 2 / 2 . 3 . 3 2 x 3 x 4 3 / [Note to the student] Parse various candidates for comment and non-comment to check the grammatical interpretation of regular expressions. For example /x/ /xx/ /x/x/Here is the simplest parse: /xx/ 0/xx/ 1xx/ 2x/ 3/ 4 Suggestion 4 (back substitute and simplify the regular grammar) Two everyday rules of the form n y n xcan be rewritten n y x*Continuing in this way you get 4 /x(/|.|x+.)*x+/ which, as it turns out, is the best known regular expression describing C comments (first shown to me by Steve Johnson). So, what can be proven? Lemma: the strings described by regular expression 4 start with /x and end with a different x/ (eliminating bogus solution /x/). This follows directly from the meaning of the regular expression.
Lemma: the sequence x/ never occurs in ( / | . | x+ . )* Lemma: the sequence x/ occurs exactly once in the contents ( / | . | x+ . )* x+ /Proof: The end of the contents is ( / | . ) x+ / and the only place x precedes / is at the end. OK, we know that the /x and x/ behave. But are all legal comments permitted? That is the implication of the McIlroy suggestion. Try induction. Consider the set of strings /x alpha x/ where alpha is of length n. We can assume by induction that the set of strings alpha contains every string of length n not containing the pair x/ Extend each alpha by . There are two cases: alpha ends in x or it does not. Assume not. The extended string is recognized by the regular expression since one can add dots one at a time until the final string of x-s. Assume so. Then appending the dot creates a new x+. which moves inside the Kleene star. So much for adding a dot. Now consider / and x. Wasn't that fun? |