MATLAB 101 -- A talk for the MIT Computer Science Department

by Bill McKeeman updated November 9, 2005

Contents

Abstract

Professors Daniel Jackson (MIT) and Robert Muller (BC) invited me to introduce MATLAB their research groups. As it turns out, the computer science department is about the only part of MIT that does NOT use MATLAB. I talked from the following notes (which were handed out). MATLAB SP3 was running on a large overhead screen during the talk.

Introduction

MATLAB is a high-level computer-based programming environment. It originally elevated matrix calculations from FORTRAN subroutine calls to assignment syntax. MATLAB has grown over 20 years from academic research into a generalized tool for a wide variety of applications, including vehicle crash simulation, financial prediction, genome analysis, imbedded computer control, aircraft design and so on. More than 200 MathWorks developers are working on the next release. Another 1000 people run the rest of the business, in Natick and worldwide.

There are about a million users. Some MATLAB users do not think of what they are doing as "programming." Some users are interested in quick and easy results. Some users want to build applications that are bullet proof and reusable over decades. I know of 100000 line MATLAB programs and MATLAB programs mining petabytes of data and others running 100x parallel for days at a time. Some universities teach MATLAB for beginning programming. On every customer contact I find new surprises. MATLAB satisfies all these communities.

The topic today is the features of the M language, their rationale, and how performance and quality are kept competitive with more conventional solutions. Much of what I have to say is available in the MathWorks online documentation. Many users never need more than MATLAB Help to get their work done.

I will illustrate the talk with running MATLAB examples. To save screen space, I will call MATLAB function "format" to tighten up the MATLAB output and "clear" to start with a clean workspace.

format compact
clear variables
          [brief interlude to look at MATLAB gui]

Origins

The fundamental structure of the orginal MATLAB was the martix.

B = ones(2,2)
C = [1,2; 2,5]
B =
     1     1
     1     1
C =
     1     2
     2     5

are 2x2 matrices. The output is displayed because there is no terminating ';' on the assignment statements.

For NxN matrices

A = B/C
A =
     3    -1
     3    -1

means multiply B by the inverse of C and store the result in A. MATLAB was originally implemented in FORTRAN according the recipe in Niklaus Wirth's "Algorithms + Data Structures = Programs". Values were double real. Scalars, Vectors and Matrices had the same representation, with shape 1x1, 1xN and MxN. MATLAB builtin functions came from the best packages available. Execution was interpretive. Efficiency was acceptable because heavy computation was carried by the builtins.

Variables

Variables are not declared in M. Each assignment of a computed value to a name causes storage to be allocated; the assigned name has a new type, shape and value. Abandoned storage is reused. Subscripts describe subarrays and storage to a variable is enlarged dynamically if needed. Assignment to part of a matrix does not change the type of the matrix. For example,

V=1:10
V(end:-1:1) = V
V =
     1     2     3     4     5     6     7     8     9    10
V =
    10     9     8     7     6     5     4     3     2     1

reverses the data in vector V.

  V = V(V>3)
V =
    10     9     8     7     6     5     4

selects all elements greater than 3 using a logical subscript and prints them out. Assuming M has not yet been used:

  M(3,3,3) = 0.0
M(:,:,1) =
     0     0     0
     0     0     0
     0     0     0
M(:,:,2) =
     0     0     0
     0     0     0
     0     0     0
M(:,:,3) =
     0     0     0
     0     0     0
     0     0     0

allocates a 3x3x3 matrix filled with value 0.0. One can fill a slice

  M(:,:,1) = magic(3)
M(:,:,1) =
     8     1     6
     3     5     7
     4     9     2
M(:,:,2) =
     0     0     0
     0     0     0
     0     0     0
M(:,:,3) =
     0     0     0
     0     0     0
     0     0     0

with the 3x3 magic square and then invert the slice

  I = M(:,:,1)^-1
I =
    0.1472   -0.1444    0.0639
   -0.0611    0.0222    0.1056
   -0.0194    0.1889   -0.1028

The whole matrix can be accessed as a vector

  M(7) = 13
M(:,:,1) =
     8     1    13
     3     5     7
     4     9     2
M(:,:,2) =
     0     0     0
     0     0     0
     0     0     0
M(:,:,3) =
     0     0     0
     0     0     0
     0     0     0

where the FORTRAN style matrix data layout is assumed. Builtin functions provide the user access to type and shape

  [a,b,c] = size(M), n = numel(M),  d = class(M),  whos M,
a =
     3
b =
     3
c =
     3
n =
    27
d =
double
  Name      Size                           Bytes  Class

  M         3x3x3                            216  double array

Grand total is 27 elements using 216 bytes

Structures are as dynamic as arrays. Assuming s has not yet been used,

  S.b = int8(17)
S = 
    b: 17

creates a struct S and a field b in S then assigns an 8-bit int scalar value to that field. The next assignment to field b gives it a new type, shape and value.

Arithmetic

I count over 1500 functions in the MATLAB help. The usual suspects (< == + - * / ^ max, abs, sin, sqrt...) are available. The "*" operator means matrix multiply. The ".*" operator means element- by-element multiply. Most functions work on all sizes and shapes of matrices as well as scalars. Accuracy satisfies a numerical analyst.

        [Brief interlude to look at MATLAB Help functions]

Control

The usual suspects (if, for, while, break, continue, switch, try, catch, function, return) complete the other half of "algorithms+data." The syntax is naturally nested so there is no need for "begin" or "{".

      [Bring up windfern.m in the MATLAB Editor]

Functions

An M function is defined in a dot-m file. It has the form

function [a,b,...] = f(c,d...)
  _statments..._

where there are nargout outputs, and nargin inputs. There is abbreviated syntax for the special cases of 0 and 1. When called, a function gets a stack frame (called the workspace). Parameters are passed by value into the workspace but a hidden optimization avoids copying unless the function modifies the input data. So

  V = sort(V)
V =
     4     5     6     7     8     9    10

causes only one memory allocation (for the result). One can implement an overloaded version of a function by placing a second function definition in a directory named for the dominant type among the input parameters. For example, sort(A) where A is a cell array of strings is defined in a file .../@cell/sort.m. The search order for the needed function is decided by the MATLAB path.

Helper functions for a MATLAB function are defined at the end of the mfile. Helper functions are only available to each other and the main function.

Nested functions are superficially similar to helper functions. The difference is that they have access to the workspace of the containing function(s) as well as their own private workspace. Nested functions are often used where a C++ programmer would use class methods.

MathWorks supplies a gazillion builtin functions with hardwired interpretation. The use is indistinguishable from user-defined functions except that the definition is out of sight. The MATLAB gui has a Help pulldown which leads to the documentation.

Since the MATLAB command line is also a user inteface (analogous to a unix shell), there is an alternative syntax, command-dual, for function calls. The two calls to list the current directory

  dir *.m
matlab101.m  windfern.m   

and

  dir('*.m')
matlab101.m  windfern.m   

have the same meaning. Command dual arguments are always strings and a command dual is always a complete statement (not an expression).

MATLAB also has lambda calculus form of function definition:

  dist = @(x,y) sqrt(x^2 + y^2);

which, in this case, defines a distance function. The call

  dist(3,4)
ans =
     5

gives 5; The body of such a function is always an expression (not a statement). The variable dist is called a function handle. It can be assigned and passed as a parameter. If the function handle is defined in another function, it carries along the needed workspace(s), even after the defining context has returned.

Eval

The compiler can be called at runtime. For example,

  x = 3; t = 'sqrt(x)'; eval(t)
ans =
    1.7321

The argument may be a string representing a statement list or an expression. There are several forms of eval.

Toolboxes

Toolboxes full of M code, most notably SIMULINK, began to appear as soon as MATLAB was available. There are now about 70 MATLAB-based products sold by MathWorks and many more offered by other individuals and organizations.

MATLAB Evolution

MATLAB and its toolboxes are primarily written in C++, Java and M. The needs of toolbox authors motivated extensions to the M language and its supporting library of functions. Strings and graphics are required to present results. Efficiency sometimes required the ability to drop from M into user-written C and FORTRAN giving rise to mex files. Image processing, robotics, and imbedded applications require integer types. Large applications require program structuring constructs. I18N require UTF and JIS. The need for efficient development gave rise to an integrated programming environment, debugging and performance tools, and a program advisor called mlint. Windows were opened in MATLAB to access Maple, Java and .NET. The need for performance motivated JIT technology. The need for our customers to securely sell their MATLAB solutions to third parties has motivated exporting capabilities. The need for reliability and scalability now drives a company-wide quality assurance programme.

Hundreds of books are based on MATLAB. The codebase has increased from about 2 million lines of code in 1997 to 23 million lines of code in 2005. The process goes on today.

Semantic Model

It is reasonable to ask about the underlying semantic model of MATLAB. MATLAB is what MATLAB does. Twenty years ago it was simple: imperative with dynamic type. Execution was carried out by a p-code style interpreter; modifying the interpreter was the only practical limitation on several generations of inventive MATLAB developers. For a long time none of those developers were compiler or programming language experts. The good news is customers got what they needed. The bad news is that new features do not always fit well with the old, making it hard to tell a "simple MATLAB story."

Nowadays "experts" are slowly sanding off the rough corners left by the ad-hoc growth of MATLAB. The operable limitation is backward compatibility. MathWorks is committed to making old M programs run on new releases, although once in a long while some feature does need to be deprecated and retired. Often old code turns out to have surprising behavior. For instance, the time functions (clock, cputime, tic, toc, date...) have been pushed well beyond their designs. There are not enough bits in double float to represent the granularity with which time can now be measured. Variable speed processors invalidate some assumed properties of the cpu-as-a-clock. Each new hardware or service pack can bring a new surprise to the surface.

The latest extensions to the MATLAB language are in the direction of ML.

Performance

In the beginning, for an assignment

x = a + b;

the p-code pushed pointers to a and b on the run stack and called the "plus" function (written in C) which then rummaged through the headers for a and b, discovered type and shape, carried out the required "+" operation(s), leaving the result in newly allocated storage which the p-code then assigned to x. When a and b were scalars, this was about 1000x slower than the hardware floating add instruction.

One could, and MathWorks did, translate M to C. The first version of mcc did type and shape analysis for a subset of M and therefore made pretty and efficient C. The second version of mcc did the whole language but lost some of the type analysis in the process, and therefore some of the efficiency. The JIT can usually turn "x = a + b" into a couple machine instructions. For many things MATLAB is fast enough that there is no motivation to translate to C.

Parallel to the JIT development, integer types were added to the language. The 8, 16, 32 and 64 bit ints implement saturated arithmetic. That is

   2*uint8(200) == uint8(255)
ans =
     1

This is good for image processing (you can't get whiter than white) and robotics (the arm runs into a stop somewhere) but less good for users who secretly want C semantics (and speed). The int vector operations however are faster than C because the Intel MMX hardware directly implements saturation. The genomics folks are happy enough not allocating 64 bits for A, T, G and C.

In addition to general speed-up mechanisms, there are sometimes cycles to be wrung out of individual functions. The MATLAB sort is faster than C qsort, and works on all the MATLAB types and shapes. Getting sort to this point required an investment of considerable time and effort. And sort is only one of the gazillion builtins.

The bottom line on performance improvement is that MATLAB is 10000 times faster than it once was for many things. A factor of 100 is a gift from the hardware during the 8086 to Pentium 4 evolution.

open('perfcontext.fig')

Measuring Performance

Suppose a large organization has a central file server holding working data and programs. A data-read command crosses the network and is queued on the server. Once the data gets off the rotating machinery, it goes into server cache, back across the network and into local network cache. If the data is executable, it must either be compiled or at least linked prior to starting anything the user cares about. The second fetch of the same data may be, relatively speaking, instantaneous. Or the your first fetch may accidentally be fast, if someone else happened to get your data into the server cache for you.

Getting performance data that is meaningful to a developer requires an experimental setup isolated from external events including email, browsers and even casual mouse movements. My laptop must be kept cool, fully charged, and 100% busy; otherwise it is willing to change its own clock rate 100 times a second. Some computers halt in the idle loop, waiting for an interrupt to resume ticking. A performance test including i/o is almost certainly going to do some waiting.

There are several consequences to the above challenges. Time data is noisy at best. Multiple processors and threads may do things "off the books." Wallclock time is the metric we have chosen to optimize. Here is some data on MATLAB sort over the last ten years.

open('sort.fig');

We run and ignore a few warmup tests, then take the minimum time of the next N tests. The reasoning is that the "real" computation is always done and is sometimes polluted with incremental positive noise. Because of the need for repeatable results, testing and tuning are almost never done with a user-realistic setup. We tell ourselves that making MATLAB fast in a vacuum is all we can contribute; the rest is up to our platform vendors, careful configuration, and wise users.

We can run windfern a few times, printing the time to compute the fern and time to plot the results. Plotting is slower than computation by a considerable factor. All of the computation is JITted.

windfern(50000,5)
rep=1 compute=0.1 secs plot=1 secs
rep=2 compute=0.3 secs plot=3 secs
rep=3 compute=0.2 secs plot=2 secs
rep=4 compute=0.2 secs plot=2 secs
rep=5 compute=0.2 secs plot=2 secs

Quality

It is hard to get 17 million lines of code right on several platforms, depending on operating systems, compilers, run times, Java, .NET, unicode, graphics accelerators, and other technologies that are themselves under serious developmental stress.

A substantial percentage of the programming staff is devoted to building and running test suites. Bugs are tracked from first report to verification. Whenever a bug is found and fixed, a definitive test is added. Where exhaustive tests are infeasible, the space is randomly sampled and compared across platforms and releases. Limits are found and exceeded so that MATLAB does not crumble into a rubble of bits. (How many left parens does it take to break the parser?). All code is reviewed by someone other than the developer before it can be submitted for inclusion in shipping code. Prior to ship, the entire company participates in an all-day bash to explore the dark corners and comb out remaining errors. Customers are invited to report any behavior that seems wrong to them, both in beta testing and after the product is shipped. Other traditional manufacturing quality methods are applied.

Bottom Line

MATLAB is a working programming system. It saves time and effort. A lot of people spend their time keeping it so.