File CodeOptimization.html    Author McKeeman    Copyright © 2007    index

Code Optimization

The "As If" Rule

When a compiler, for reasons of efficiency, does things differently than one would expect from a literal reading of the original program, the compiled program is required to get the same results as if nothing had been changed. The outside observer must not be able to detect the difference, except by improved performance.

The "as if" rule is sometimes relaxed for floating point computations. Reordering operands in an expression can affect the roundoff. It is quite easy to write an expression in C that will evaluate to NaN with one order of evaluation, and zero with a different order. The question is whether the user wants exact semantics or faster execution. The compiler needs a user-settable flag.

There is no code optimization in xcom, so there are no examples to examine. In this short treatment, generalities will have to suffice.

Semantic Specification

Making a formal statement about the "as if" rule depends on having a specification of semantics. The usual form of such a specification is the combination of a simple translation/abstract machine pair. The idea is to make the meaning immediately obvious.

In the case of X, the simple translation would target the syntax tree and the abstract machine would be a tree-walking interpreter. If the such an abstraction were implemented, it would provide an executable way to test xcom for any input program.

The Compile Time/Execution Time Tradeoff

Making the executable run faster is not always a good idea. The decision depends on how many times the executable is going to be run and who is waiting while the compiler is working hard to optimize. In the case of a JIT, expensive optimization is rarely justified because compile time is user time. In the case of off-line compilation, such as when using gnu C, more optimization can be included in the default behavior. Compilers almost always supply flags to allow the user to choose the level of optimization, partly to avoid waiting for long compiles. Usually the debugger does better with unoptimized code, so the user turns optimization off during development. Because xcom itself is meant to be read as an example, xcom does no optimization, thereby simplifying the learning experience.

Close in Time -- Close in Space

The performance bottleneck is almost always access to main memory (sometimes called RAM for Random Access Memory), but nowadays the more random the access, the slower the code. A cache miss is many times more expensive than executing an instruction. Modern optimization is, then, more about keeping access in the cache(s) than about avoiding some particularly slow instructions. Following randomly distributed memory links is the worst case. Another is a long sequence of calls, each of which is "somewhere else" in the metrics of the cache.

The local variables of a compiled function are usually kept in a frame: a contiguous block of memory accessed via a stack pointer (register). Based on the assumption that the locals will be frequently accessed in a short time interval, the frame should be kept compact, and therefore likely to be held in one or a few cache lines.

Local Optimization -- Constant Folding

An obvious optimization is constant folding. A programmer seeing

     x := y + 1 + 2 + z;

feels the urge to rewrite it as

     x := y + 3 + z;

The original programmer may have had a good reason to write 1+2, expressed in nearby comments describing the reason for constants 1 and 2. That is to say, what is optimal for computer performance may not be optimal for the human writer or reader.

It is required that the compiler, adding 1 and 2 at compile time, use exactly the same arithmetic that the hardware would have used later. Insuring consistency takes extra care when the compiler and the executable might run on different machines.

In fact, the rewrite above does not follow the formal semantics which would be, because the X CFG says '+' is left associative,

    x := ((y+1)+2)+z;

The optimizer "knows" that 2's complement addition is computationally associative, so it can go ahead with the constant folding based on the "as if" rule.

It is often the case in complex programs that the programmer cannot eliminate an actually constant expression by hand because the language does not permit enough control over the details. The expectation is that the compiler will take care of the details for us, under the covers, and without any special effort on the part of the programmer.

Constant folding is often implemented by rewriting the intermediate representation. Using the syntax tree for an intermediate representation is not efficient if tree transformations are expected. Instead a more compact abstract syntax tree (AST) is constructed by the parser and used in the downstream processing.

The AST for x:=((y+1)+2)+z;

        :=
       /  \
      x    +
          / \
         +   z
        / \
       +   2
      / \
     y   1
      
    

is transformed into the AST for x:=((y+(1+2))+z

       :=
      /  \
     x    +
         / \
        +   z
       / \
      y   +
         / \
        1   2
    

and then the node for (1+2) is nipped out and replace by a node for 3.

      :=
     /  \
    x    +
        / \
       +   z
      / \
     y   3
    

Local Optimization -- Identities

Consider these two assignments for some expression.

x := expr^2; x := expr*expr;

Which one should the programmer write? What should the compiler do? If the expression does not have any side effects (such as calling rand), the two lines of code express the same meaning. The compiler should "do the right thing", meaning make the same machine language regardless of what is written.

If the expression is large, computing it only once is more efficient. Once computed, it can be multiplied by itself. The optimization is obvious and easy with the first line of code. The second line requires common subexpression elimination (see below).

Suppose the choice was between lines

x := rand^2; x := rand*rand;

The optimizer no longer has a choice since rand has a side effect. What the programmer wrote must be literally obeyed.

Many compilers have a long list of identities which are used to find the simplest form for translation, depending on what the compiler does well. Here is a small sample. The transformations can be applied recursively (For example x+0=x to x=x to true.) so long as there are no side-effects are lost.

source text replacement
x | xx
x | truetrue
x | falsex
x & xx
x & truex
x & falsefalse
x < xfalse
x <= xtrue
x = xtrue
x + 0x
x * 1x
x * xx ^ 2
if true ? S fiS
if false ? S fi 

xcom could apply these transformations, but only after the type analysis. For instance, xcom infers x is of type int from the expression x+0.

Because of numerical roundoff and instability, equivalence transformations are much trickier for floating point values.

Local Optimization -- Peephole

Another source of local optimization is found in code selection and transformation at the machine level. For instance, the sequence

    y := 1;
    z := y + 3;

will almost surely fetch the value of y into a register again just after storing it into memory. But the register already has the value of y so the fetch can be eliminated. This is called a peephole optimization because it can be done based on looking at and transforming a small region of code after it has been synthesized. The implementation of peepholes is left to a smart assembler. The compiler gives the assembler a valid sequence of instructions; the assembler knows better and changes the code to run faster.

For this example, the X86 code sequence

      movRC EAX,=1
      movMR y,EAX
      movRM EAX,y
      movRC ECX,=3
      ADDRR EAX,ECX
      movMR z,EAX

becomes

      movRC EAX,=1
      movMR y,EAX
      movRC ECX,=3
      ADDRR EAX,ECX
      movMR z,EAX

Non-local Optimization -- Common Subexpression Elimination

It is possible that a subexpression, say i+1, is used in several places in a program. The optimizer would compile code to compute the subexpression and assign the result to a temporary variable, then use the variable in place of the subexpressions. While one might think the original programmer could do this, it should not be a requirement since it might make the program less readable.

More to the point, many common subexpressions are not accessible to the programmer but can be manipulated by the compiler. The typical case is a complicated set of subscripts. For example

    A(i,j) = B(i,j)*3 + sqrt(A(i,j))

In MATLAB, each subscript not only needs to be evaluated, but also checked for being an exact integer, and greater than zero, and no more than the array bound. The result is a lot of repetitive code doing exactly the same thing, over and over and over. The result is common subexpressions, but at a level below direct control of the programmer. The optimizer has to do the work.

This valuable optimization comes at a cost. The compiler must check that all instances have exactly the same value for each variable (no intervening assignments, no intervening branches, no changes from other threads). Then a place is picked to insert the unique prior evaluation and assignment to the temporary, and the existing subexpressions have to be deleted and replaced by an access to the temporary.

Suppose the unique evaluation of the subexpression throws an exception. The executable must, of course, report the exception. But what source location can be reported for inserted code? Single stepping with a debugger can be equally confusing.

Non-local Optimization -- Hoisting

An expression (common or not) may be constant inside of a loop. The compiler can move the point of evaluation outside the loop, assign the value to a temporary and use the temporary inside the loop. Because loops can nest, a hoisted expression can be part of another hoisted expression. This trick causes the same class of problems mentioned above with respect to exceptions and debuggers. In a MATLAB solution to the wave equation, moving a constant sparse matrix multiply out of the loop led to a 5x speedup.

Non-local Optimization -- Register Allocation

Keeping frequently accessed data in a register (as contrasted to keeping it in memory) is a powerful method of increasing execution speed. The problem is, of course, that there are only a limited number of registers. An optimizing compiler may never put a heavily used variable (such as a loop control index) in memory. Such a compiler typically has two stages of code production. The first stage assumes there are an infinite number of registers. Once the code is prepared, it is examined to find what actual assignment of registers is most efficient. Register spill instructions are inserted as necessary to clear registers for use with the expectation that they will be reloaded later.

Non-local Optimization -- Parallel Execution

If the hardware contains multiple execution units, and the compiler can discover independent paths of execution that do not write to any memory that is shared, and the paths are long enough to repay the overhead, then each path can be launched on a separate thread, joining and proceeding sequentially after the slowest of the bunch is finished. Automatic discovery of such independent paths in general is very hard. The fundamental reason is that the original code is usually not written with parallelism in mind, computation is interleaved however the programmer feels is the clearest expression of intent.

Serious Industrial-strength Optimization

A competitive optimizer takes 100+ engineer-years of effort. Here is a brief paper about a code optimizer, built by a team of 20 during the 1990's at Digital (Note: it was called Gem, which is also the name used in this course for Grammar Execution Machine.)