Second Forum on Parallel Computing Curricula
Sunday, June 22, 1997
Newport, RI
We discuss the use of the Message Passing Interface Standard (MPI) in the teaching of undergraduate classes in parallel computing. We briefly introduce the MPI standard and discuss why we have chosen to use it in parallel computing classes. We also discuss the use of MPI in a conventional parallel algorithms class and in a project-oriented class. We include a brief enumeration of some of the drawbacks to using MPI and how these drawbacks are being addressed by the MPI-2 standard.
Until recently instructors of courses in parallel computing necessarily discussed the use and development of parallel software in the context of a vendor's proprietary software or software being developed by a research group, or, even worse, simply left the problems of using and developing parallel software as ``exercises'' for the students. A consequence of this was that whatever software tools a student used and methods a student learned for parallel software development would, very likely, be useless or unavailable when he or she went on to additional courses or went to work in parallel software development. Further, when the hardware environment for classes changed, it was also very likely that the previously used software wouldn't run on the new system. This unfortunate state of affairs exacerbated the already extremely difficult problems of parallel computing education and parallel software engineering.
Now, however, the advent of the carefully designed standards for programming parallel systems, High-Performance Fortran (HPF) [HPF] and MPI [MPI], provides us, as teachers of parallel computing, with the opportunity to illustrate parallel algorithms in a uniform manner, to teach the development of portable parallel software, to use parallel libraries and to develop portable parallel software for our courses. Thus our students will learn a lingua franca for discussing parallel software, and they will learn methods for developing parallel software that will continue to be useful after completing our classes. Further we will be able to amortize our investment in the development of parallel software for classes by continuing to use the software we develop on future hardware.
At the University of San Francisco, we have exploited this opportunity by using the Message Passing Interface standard (MPI) in our parallel computing courses. In this article we will briefly discuss MPI and why we have chosen to use it in our classes. We will also discuss how we have used it in two different types of classes. The first class is a parallel algorithms class. The second class is a software development project in which the students spend a year working on a project specified by a company or government agency. We close with a discussion of some of the drawbacks to the use of MPI and how MPI-2 [MPI-2], the recently announced collection of extensions to MPI, helps address these drawbacks.
MPI is not a
new programming language; rather it is a library of subprograms
that can be called from C and Fortran 77 programs. It was developed
by an open, international forum consisting of representatives from industry,
academia, and government laboratories. Unlike most
proprietary and academic parallel programming systems,
MPI is based almost entirely on widely-used, well-understood
methods for parallel programming. At its heart is
the simple send-receive pair: in order for two
processes to communicate data, the sender calls
the function MPI_Send and the receiver calls
the function MPI_Recv. In the C extensions
to MPI, their syntax is
int MPI_Send(void* message, int count, MPI_Datatype datatype,
int destination, int tag, MPI_Comm comm);
int MPI_Recv(void* message, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Status* status);
MPI was designed to facilitate
the development of Single-Program Multiple Data (SPMD)
programs, although it is perfectly feasible to write
MPI programs that run completely different codes on
different processes.
MPI assumes that an instance of a parallel program consists of a collection of autonomous processes with distinct non-negative integer ranks. The process ranks are used to identify source and destination in communications. The standard specifies a variety of point-to-point and collective communication functions. The point-to-point functions include blocking and non-blocking versions of both buffered and synchronous functions. The collective functions include barrier, broadcast, gather, scatter, and a variety of global reduction operations.
The principal innovations in MPI are communicators and user-defined datatypes. Communicators provide a system-defined method for partitioning ``message-space.'' With user-defined datatypes programs can send a message composed of noncontiguous data without the additional work of packing and unpacking the data. We discuss each of these in turn.
The need for communicators became apparent when developers tried to address the complexity of parallel programming by encapsulating frequently occurring constructs in libraries. For example, most vendors provided libraries of functions for collective communications [nCUBE,Pierce], and a few researchers worked on providing libraries for common mathematical operations such as matrix multiplication [Choi,vandeGeijn]. One difficulty that developers encountered was that they had no ready means of guaranteeing that messages sent by the library were not intercepted by user-defined functions or vice-versa. Essentially the only means for assuring that messages were not incorrectly received was the tag or message-type.
Recollect that a
tag is simply
an integer argument that is passed to a communication function
and that can be used to uniquely identify a
message. For example, in MPI if process A sends a
message to process B, then
in order for B to receive the message, the tag used in
A's call to MPI_Send must be the same as the
tag used in B's called to MPI_Recv. Thus,
if the characteristics of two messages sent by A
to B are identical (i.e., same count and datatype),
then A and B can distinguish between the two by using
different tags.
For example, suppose A is sending two
floats, x and y, to B. Then the processes
can be sure that the values are received correctly, regardless
of the order in which A sends and B receives, provided different
tags are used:
/* Assume system provides some buffering */
if (my_rank == A) {
tag = 0;
MPI_Send(&x, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
. . .
tag = 1;
MPI_Send(&y, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
} else if (my_rank == B) {
tag = 1;
MPI_Recv(&y, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status);
. . .
tag = 0;
MPI_Recv(&x, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status);
}
Now if one message from process A to process B is being sent by the library, and another, with identical characteristics, is being sent by the user's code, unless the library developer insists that user programs refrain from using certain tag values, this approach cannot be made to work. Clearly, partitioning the set of possible tags is at best an inconvenience: if one wishes to modify an existing user code so that it can use a library that partitions tags, each message passing function in the entire user code must be checked. If a user wishes to use two libraries in a single program, unless he or she has the source code for the libraries, this method cannot work.
Since it is commonly believed that one solution to the problem of the complexity of parallel programming is the use of libraries, this posed a difficult problem for the MPI Forum. The solution that was ultimately decided on was the communicator. Formally, a communicator is a pair of objects: the first is a group or ordered collection of processes, and the second is a context, which can be viewed as a unique, system-defined tag. Every communication function in MPI takes a communicator argument, and a communication can succeed only if all the processes participating in the communication use the same communicator argument. Thus, a library can either require that its functions be passed a unique library-specific communicator, or its functions can create their own unique communicator. In either case, it is straightforward for the library designer and the user to make certain that their messages are not confused.
For example, suppose now that the user's code is sending a float,
x, from process A to process B, while the library is sending
a float, y, from A to B:
/* Assume system provides some buffering */
void User_function(int my_rank, float* x) {
MPI_Status status;
if (my_rank == A) {
/* MPI_COMM_WORLD is pre-defined in MPI */
MPI_Send(x, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);
} else if (my_rank == B) {
MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
}
. . .
}
void Library_function(float* y) {
MPI_Comm library_comm;
MPI_Status status;
int my_rank;
/* Create a communicator with the same group */
/* as MPI_COMM_WORLD, but a different context */
MPI_Comm_dup(MPI_COMM_WORLD, &library_comm);
/* Get process rank in new communicator */
MPI_Comm_rank(library_comm, &my_rank);
if (my_rank == A)
MPI_Send(y, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);
} else if (my_rank == B) {
MPI_Recv(y, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
}
. . .
}
int main(int argc, char* argv[]) {
. . .
if (my_rank == A) {
User_function(A, &x);
. . .
Library_function(&y);
} else if (my_rank == B) {
Library_function(&y);
. . .
User_function(B, &x);
}
. . .
}
Prior to the advent of MPI, it was necessary in many parallel systems to send noncontiguous data or data composed of different types in separate messages or to first pack the data into a single message and then unpack the data when it was received. Indeed, this may have been necessary even if the communications subsystem allowed applications to gather data when a message was sent and to scatter data when a message was received. The second main innovation in MPI, user-defined datatypes, allows programmers to exploit this power, and as a consequence, to create messages consisting of logically unified sets of data rather than only physically contiguous blocks of data.
Loosely, an MPI datatype is a sequence of
displacements in memory together with
a collection of basic datatypes (e.g.,
int, float, double, and char).
Thus, an MPI-datatype specifies the layout in
memory of data to be collected into a single
message or data to be distributed from a single
message. For example, suppose we specify
a sparse matrix entry with the following definition.
typedef struct {
double entry;
int row, col;
} mat_entry_t;
MPI provides functions for creating
a variable that stores the layout in memory of
a variable of type
mat_entry_t.
One does this
by first defining an MPI datatype
MPI_Datatype mat_entry_mpi_t;
to be used in communication functions, and then calling
various MPI functions to initialize mat_entry_mpi_t
so that it contains the required layout.
Then, if we define
mat_entry_t x;
we can send x by simply calling
MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm);
and we can receive x with a similar call to
MPI_Recv.
There are currently two standards for programming parallel systems that were developed in open forums: High Performance Fortran (HPF) [HPF] and MPI [MPI]. HPF relies on advanced compiler technology to expedite the development of data-parallel programs. Thus, although it is based on Fortran, HPF is a new language, and hence requires the construction of new compilers. As a consequence each implementation of HPF is, to a great extent, hardware specific, and until recently there were very few complete HPF implementations. Furthermore most of the current implementations are proprietary and quite expensive.
On the other hand, MPI, as we noted earlier, specifies a library of extensions to C and Fortran that can be used to write message passing programs. So an implementation of MPI can make use of existing compilers, and it is possible to develop more-or-less portable MPI libraries. Thus, unlike HPF, it is relatively easy to find an MPI library that will run on existing hardware. Indeed, there are at least three implementations [Gropp,OSC,Bruce] that run on networks of Unix workstations, and there are two implementations [Meyer,Marinho] that run on Windows PC's. Further, all of these implementations can be freely downloaded from the internet, and all but one of them comes with source code.
Perhaps the most important difference between HPF and MPI lies in the fact that HPF has been written for the express purpose of writing data-parallel programs, and, as a consequence it is not well-suited for dealing with irregular data-structures or control-parallel programs. Message passing, on the other hand, is a completely general method for parallel programming. Indeed, the generality and ready availability of MPI have made it one of the most widely used systems for parallel programming.
Ideally, a parallel computing class would introduce students to both HPF and MPI. However, it is a truism that learning to program parallel systems is extremely difficult, and we believe that attempting to learn two essentially different methodologies in one semester or one quarter will, especially for the weaker students, make it virtually impossible to learn either. Thus in view of the difficulties involved in obtaining an HPF compiler and the generality of MPI, we have chosen to use MPI.
The parallel algorithms class at USF is a semester-long class for upper-level undergraduates and beginning graduate students. In it we used a draft version of the MPI text [Pacheco97] as a supplement to a parallel algorithms text [Kumar]. Since our computer science students learn to program in C++, all the code that we present in the lectures uses C rather than Fortran.
The course begins with an overview of parallel computing (chapter 2 in [Pacheco97]) and continues with a brief introduction to message passing and MPI. The next block of lectures forms a transition into a more or less standard parallel algorithms course. We first discuss serial and parallel versions of a very simple computation -- e.g., dot product. In the course of analysing the performance of these algorithms, we develop the concepts of speedup and efficiency. The deterioration of the performance of the parallel algorithm as the number of processes is increased leads naturally into a discussion of Amdahl's Law and scalability. The remainder of the lectures are devoted to the development and analysis of a variety of standard parallel algorithms from linear algebra, some algorithms for searching and sorting, and graph algorithms. See [Pacheco96] for further details.
Much of the detailed introductory material
on MPI is introduced in the lab through the tutorial
chapters (3 - 7) in [Pacheco97].
More advanced special functions from MPI that facilitate
implementation of the algorithms are introduced in the
lectures as
necessary. For example, in the discussion of
dynamic load balancing, we discuss nonblocking
communication and MPI_Iprobe,
which can be used to test for the arrival of
unsolicited messages.
The coursework consists of a midterm, a final exam and four programming assignments. The purpose of the first assignment is simply to acquaint students with the system; in it, they write a ping-pong code, and estimate the bandwidth and latency of the system. The second and third assignments are more substantial. In the second, the students implement Jacobi's method for solving a linear system, and in the third, the students implement parallel bitonic sort. The last assignment is to be a project of the student's choosing. However, most students ask for suggestions, and most implement a version of parallel tree search that uses dynamic load balancing. See [Pacheco96] for further details.
The students develop their programs
on a network of SGI's and an nCUBE 2 with a Sun
workstation host. Both systems run the
mpich [Gropp] implementation of MPI. Although
both the LAM [OSC] and Chimp
[Bruce] implementations are
available on the network of SGI's,
the students are encouraged to use mpich since it is the
only implementation that
allows full access to stdio from all
processes. This seems to outweigh the
benefit of access to better debugging
facilities with the LAM implementation, since it
seems
that most students don't make use
of any debugging facilities beyond printf
and fflush. Furthermore, when they do
try to use debuggers, they frequently get
confused by the details of how to use them.
Except for the first assignment, the students found the programs to be quite difficult. Most of the difficulties, however, had little to do with MPI. We had taught similar classes twice previously, and in both of the previous classes we used the nCUBE library of message-passing functions. In both the MPI-based class and the nCUBE-based classes, the students had little difficulty understanding the use of basic point-to-point and collective communication functions. Rather the most serious difficulties that students had with actually writing code were the differences between C and C++. Virtually every student in the class needed help with parameter passing in C: it seems that since C++ allows reference parameters, students had never learned about passing pointers.
The two aspects of MPI that caused the most difficulty were its prohibition on aliasing of arguments and the use of user-defined MPI datatypes. Roughly, the prohibition on aliasing of arguments in MPI says that a single variable cannot be passed as an argument to two distinct formal parameters of an MPI function. For example, if one wishes to distribute an array of floats stored on process 0 among all the processes, the following function call is illegal in MPI:
float x[MAX];
MPI_Scatter(x, MAX, MPI_FLOAT, x, MAX/num_procs, MPI_FLOAT,
0, MPI_COMM_WORLD;
MPI requires that distinct arguments be passed
to the first and fourth formal parameters.
User-defined datatypes seem to cause difficulty because the rules for type-matching in MPI are fairly involved, and they increase the complexity of message-passing. Getting user-defined datatypes to match adds one more potential problem to the already difficult task of sending messages.
This last point should be stressed: the greatest difficulty the students encountered didn't involve details of syntax; rather their greatest difficulty was with the logic of handling message-passing.
The applied mathematics research laboratory [Pacheco94] at USF is a class in which seniors and beginning graduate students spend a year working on an industry or government sponsored project. In the academic year 1994-1995 IBM and nCUBE sponsored a project in which the students worked on parallelizing the speech recognition program Sphinx [Lee]. Previous projects have involved parallelizing circuit simulation programs and developing parallel libraries for the iterative solution of sparse linear systems.
The first semester was organized as a seminar. It began with lectures to the students on the basics of signal processing, speech recognition, and parallel computing. During the remainder of the semester, the students took turns giving presentations on more specialized aspects of these topics. During the second semester, there were no formal classes. The students were given a broad outline for attacking the project, and they divided into groups to work on various aspects of the project. The entire group met once or twice a week to discuss problems and progress. See [Pacheco94] for details.
The vast majority of the work during the first two-thirds of the second semester involved programming. First, the students worked on understanding the serial code, and then they worked on parallelizing it. During the final third of the second semester, most of the students moved from working on the code to developing a multimedia presentation on the project for the sponsors and the University.
In order to learn about parallel programming, during the first semester the students were assigned two small, individual programming projects. For the first they parallelized the trapezoidal rule and for the second they parallelized a brute-force solution to the travelling salesman problem. Before the first assignment, there was a lecture on those parts of MPI that they would need for the assignment, and for the second assignment they were given a preliminary version of A User's Guide to MPI [Pacheco95]. See [Pacheco94] for further details.
When the students started working on parallelizing
Sphinx during the second semester, their main
resources for information about MPI
were the instructor, the User's Guide and the on-line
MPI man pages.
The students developed the program on an nCUBE 2
running the mpich implementation [Gropp]
of MPI. All coding was done in C.
The code itself involved a parallel
tree search with a static partitioning of the tree.
The students' difficulties with the technical details of programming were similar to those encountered in the parallel algorithms class. They had difficulty with the differences between parameter passing in C and C++, the prohibition on aliasing in MPI, and the use of user-defined MPI datatypes.
The project itself was highly successful. The students obtained nearly linear speedup with 16 or fewer processors, and after seeing the final presentation, the sponsors were so impressed that they said that they would be happy to offer jobs to all of the students.
The two biggest drawbacks to using MPI in a class are the fact that programming must be done in C or Fortran and the fact that the standard fails to specify any I/O capabilities. As any teacher of computer science knows, C and Fortran are not very good languages for learning to program. Further, the fact that C++ is an extension of C leads the unwary to believe that if they know how to program in C++, they also know how to program in C -- a conclusion which we found to be unwarranted in every case.
The fact that MPI failed to specify
I/O means, unfortunately, that any MPI program that
carries out I/O will not be portable.
We alluded to this problem earlier, when
we mentioned that neither the Chimp
[Bruce]
implementation nor the (earlier) LAM
[OSC]
implementation allow every process access
to stdio. Fortunately, this problem
was somewhat mitigated by the fact that
the mpich implementations that
we used do allow each process full access
to stdio.
There are, of course, other objections to using MPI: it makes no provision for dynamic process management, and hence dynamic client-server applications are poorly suited to MPI. It also makes no provision for one-sided communications, and one-sided communications are currently showing great promise in reducing the high cost of communication.
Fortunately, most of these problems are being addressed in MPI-2 [MPI-2]. After the introduction of MPI, the MPI Forum continued its work on developing a portable standard for parallel computing. Its main purpose is to address deficiencies in the current standard. It will include specifications for parallel I/O, one-sided communications, and dynamic process management. It will also include C++ bindings for all of the MPI functions. The final MPI-2 specification is scheduled for completion by June, 1997. So by the fall of 1997, it should be possible to include many of these features in an MPI-based parallel computing class.
In general students seem to like
MPI. In both classes, they made extensive use of
its collective
communication functions: they wanted to use
them, even when it meant that they would have
to learn about complex user-defined datatypes.
They seemed to find
the concept of a communicator quite natural and
the details of using communicators in programs caused
few problems. Those students that experimented with more
esoteric functions, such as nonblocking communication and
MPI_Iprobe,
didn't seem to have much difficulty.
A notable exception to this was the
use of user-defined MPI datatypes. Although all
of the students eventually used them successfully,
it seems that they are peculiarly susceptible to
extremely obscure bugs. There is some software
designed to help with use of user-defined datatypes
[May]. In the future we plan to
make this software
available to the students.
MPI can be readily incorporated into the curriculum of any parallel computing class that uses message-passing programming. Since code, development tools, and methods for developing programs in MPI are all portable, its incorporation should greatly benefit both students and instructors. The fact that there are freely available, high-quality implementations of MPI that run on virtually all platforms makes it very practical to use MPI in parallel computing classes, and when these observations are combined with the fact that MPI-2 addresses the main limitations to MPI, we have no hesitation in strongly endorsing its use in any parallel computing class.
mpich,
a Portable Implementation of MPI.
Online document available as URL
ftp://info.mcs.anl.gov/pub/mpi/userguide.ps.Z. 1997.