This is an answer to one of the regular expression exercises.

The C comment, /*...*/ can be described by a regular expression. Do so. Prove that your solution describes all, and only, C comments.

Don't peek unless you are done working on the problem!

|
V
|
V
|
V
|
V
|
V
|
V
|
V
|
V
|
V
|
V
|
V
|
V

Answer by McKeeman

From the C standard 6.4.9...Except within a character constant, a string literal, or a comment, the characters /* introduce a comment. The contents of a comment are examined only to identify multibyte characters and to find the characters */ that terminate it.

One must be a bit of a C lawyer here. /* introduce a comment. OK, then I infer the /* are not part of the contents of the comment. On the other hand, the terminating */ are  a part of the contents because they are found when examining the contents. Because the definition would make no sense otherwise, I also infer that there are no earlier terminating */ in the contents. It is this sloppily worded definition which we are asked to implement and prove. We take advantage of the ill-definition by merely requiring our proofs to be convincing.

First, some simplifications. The * in C gets tangled up with the Kleene star of regular expressions and except for / and *, all the other characters in a C comment are equivalent. Therefore we introduce a 3-element universe of

    U: / x .
  
where / stands for itself, x stands for the ASCII *, and . stands for all the other characters in C. For example
    /x...x/
  
is a C comment in our artificial universe. The rest of ASCII can now be used for regular expression meta characters.

Suggestion 1 (by Doug McIlroy)

  /x~(U*x/U*)x/
This can be read as "a C comment is /x followed by anything but a string that contains x/, followed by x/". This suggestion not exactly an answer because "not such and such a string" is not a regular expression. But it is elegant, obviously correct, and leads to other solutions.

Suggestion 2 (DFA)

The spine of the diagram points down. The transitions on the right point up. State 4 is final. The table of triples gives the same information; it is actually a CFG for a DFA.

                                   go in see 
    0                               1  0 /
   /|                               2  1 x
    1                               2  2 /
   x|                               2  2 .
    +--+--+--+                      3  2 x
    2 /| .|  |                      3  3 x
    +--+--+  |                      2  3 .
   x|       .|                      4  3 /
    +--+     |
    3 x|     |
    +--+-----+
   /|
    4

One reads the diagram as "in state 0, if you see /, eat it and transition to state 1..." This suggestion is not exactly an answer either since a DFA is only equivalent to a regular expression.

Suggestion 3 (regular grammar)

Using the everyday notation for grammars, we can describe the DFA with rules that have exactly 2 or zero symbols on the RHS. The rules can only be applied to a string that already contains a nonterminal (e.g. a state). So there is a special rule for 0 that is created out of nothing to get things started.

0
1
  0 /
2
  1 x
  2 /
  2 .
  3 . 
3
  2 x
  3 x
  4
  3 /

[Note to the student] Parse various candidates for comment and non-comment to check the grammatical interpretation of regular expressions. For example

/x/
/xx/
/x/x/
Here is the simplest parse:
/xx/
0/xx/
1xx/
2x/
3/
4

Suggestion 4 (back substitute and simplify the regular grammar)

Two everyday rules of the form

n
  y
  n x
can be rewritten
n
  y x*
Continuing in this way you get
4
  /x(/|.|x+.)*x+/

which, as it turns out, is the best known regular expression describing C comments (first shown to me by Steve Johnson).

So, what can be proven?

Lemma: the strings described by regular expression 4 start with /x and end with a different x/ (eliminating bogus solution /x/). This follows directly from the meaning of the regular expression.

Lemma: the sequence x/ never occurs in ( / | . | x+ . )*
Proof: x is always followed by x or .

Lemma: the sequence x/ occurs exactly once in the contents

  ( / | . | x+ . )*  x+ /
Proof: The end of the contents is ( / | . ) x+ / and the only place x precedes / is at the end.

OK, we know that the /x and x/ behave. But are all legal comments permitted? That is the implication of the McIlroy suggestion.

Try induction. Consider the set of strings /x alpha x/ where alpha is of length n. We can assume by induction that the set of strings alpha contains every string of length n not containing the pair

x/

Extend each alpha by

.

There are two cases: alpha ends in x or it does not. Assume not. The extended string is recognized by the regular expression since one can add dots one at a time until the final string of x-s. Assume so. Then appending the dot creates a new x+. which moves inside the Kleene star.

So much for adding a dot. Now consider / and x.

Wasn't that fun?