Second Forum on Parallel Computing Curricula
Sunday, June 22, 1997
Newport, RI

Using MPI to Teach Parallel Computing


Peter Pacheco
Department of Computer Science
University of San Francisco
San Francisco, CA 94117
415-422-6630
peter@usfca.edu

Abstract

We discuss the use of the Message Passing Interface Standard (MPI) in the teaching of undergraduate classes in parallel computing. We briefly introduce the MPI standard and discuss why we have chosen to use it in parallel computing classes. We also discuss the use of MPI in a conventional parallel algorithms class and in a project-oriented class. We include a brief enumeration of some of the drawbacks to using MPI and how these drawbacks are being addressed by the MPI-2 standard.

Contents

1. Introduction

Until recently instructors of courses in parallel computing necessarily discussed the use and development of parallel software in the context of a vendor's proprietary software or software being developed by a research group, or, even worse, simply left the problems of using and developing parallel software as ``exercises'' for the students. A consequence of this was that whatever software tools a student used and methods a student learned for parallel software development would, very likely, be useless or unavailable when he or she went on to additional courses or went to work in parallel software development. Further, when the hardware environment for classes changed, it was also very likely that the previously used software wouldn't run on the new system. This unfortunate state of affairs exacerbated the already extremely difficult problems of parallel computing education and parallel software engineering.

Now, however, the advent of the carefully designed standards for programming parallel systems, High-Performance Fortran (HPF) [HPF] and MPI [MPI], provides us, as teachers of parallel computing, with the opportunity to illustrate parallel algorithms in a uniform manner, to teach the development of portable parallel software, to use parallel libraries and to develop portable parallel software for our courses. Thus our students will learn a lingua franca for discussing parallel software, and they will learn methods for developing parallel software that will continue to be useful after completing our classes. Further we will be able to amortize our investment in the development of parallel software for classes by continuing to use the software we develop on future hardware.

At the University of San Francisco, we have exploited this opportunity by using the Message Passing Interface standard (MPI) in our parallel computing courses. In this article we will briefly discuss MPI and why we have chosen to use it in our classes. We will also discuss how we have used it in two different types of classes. The first class is a parallel algorithms class. The second class is a software development project in which the students spend a year working on a project specified by a company or government agency. We close with a discussion of some of the drawbacks to the use of MPI and how MPI-2 [MPI-2], the recently announced collection of extensions to MPI, helps address these drawbacks.

2. About MPI

MPI is not a new programming language; rather it is a library of subprograms that can be called from C and Fortran 77 programs. It was developed by an open, international forum consisting of representatives from industry, academia, and government laboratories. Unlike most proprietary and academic parallel programming systems, MPI is based almost entirely on widely-used, well-understood methods for parallel programming. At its heart is the simple send-receive pair: in order for two processes to communicate data, the sender calls the function MPI_Send and the receiver calls the function MPI_Recv. In the C extensions to MPI, their syntax is

    int MPI_Send(void* message, int count, MPI_Datatype datatype,
            int destination, int tag, MPI_Comm comm);
    int MPI_Recv(void* message, int count, MPI_Datatype datatype,
            int source, int tag, MPI_Comm comm, MPI_Status* status);

MPI was designed to facilitate the development of Single-Program Multiple Data (SPMD) programs, although it is perfectly feasible to write MPI programs that run completely different codes on different processes.

MPI assumes that an instance of a parallel program consists of a collection of autonomous processes with distinct non-negative integer ranks. The process ranks are used to identify source and destination in communications. The standard specifies a variety of point-to-point and collective communication functions. The point-to-point functions include blocking and non-blocking versions of both buffered and synchronous functions. The collective functions include barrier, broadcast, gather, scatter, and a variety of global reduction operations.

The principal innovations in MPI are communicators and user-defined datatypes. Communicators provide a system-defined method for partitioning ``message-space.'' With user-defined datatypes programs can send a message composed of noncontiguous data without the additional work of packing and unpacking the data. We discuss each of these in turn.

2.1. Communicators

The need for communicators became apparent when developers tried to address the complexity of parallel programming by encapsulating frequently occurring constructs in libraries. For example, most vendors provided libraries of functions for collective communications [nCUBE,Pierce], and a few researchers worked on providing libraries for common mathematical operations such as matrix multiplication [Choi,vandeGeijn]. One difficulty that developers encountered was that they had no ready means of guaranteeing that messages sent by the library were not intercepted by user-defined functions or vice-versa. Essentially the only means for assuring that messages were not incorrectly received was the tag or message-type.

Recollect that a tag is simply an integer argument that is passed to a communication function and that can be used to uniquely identify a message. For example, in MPI if process A sends a message to process B, then in order for B to receive the message, the tag used in A's call to MPI_Send must be the same as the tag used in B's called to MPI_Recv. Thus, if the characteristics of two messages sent by A to B are identical (i.e., same count and datatype), then A and B can distinguish between the two by using different tags.

For example, suppose A is sending two floats, x and y, to B. Then the processes can be sure that the values are received correctly, regardless of the order in which A sends and B receives, provided different tags are used:

    /* Assume system provides some buffering */
    if (my_rank == A) {
        tag = 0;
        MPI_Send(&x, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
        . . .
        tag = 1;
        MPI_Send(&y, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
    } else if (my_rank == B) {
        tag = 1;
        MPI_Recv(&y, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status);
        . . . 
        tag = 0;
        MPI_Recv(&x, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status);
    }

Now if one message from process A to process B is being sent by the library, and another, with identical characteristics, is being sent by the user's code, unless the library developer insists that user programs refrain from using certain tag values, this approach cannot be made to work. Clearly, partitioning the set of possible tags is at best an inconvenience: if one wishes to modify an existing user code so that it can use a library that partitions tags, each message passing function in the entire user code must be checked. If a user wishes to use two libraries in a single program, unless he or she has the source code for the libraries, this method cannot work.

Since it is commonly believed that one solution to the problem of the complexity of parallel programming is the use of libraries, this posed a difficult problem for the MPI Forum. The solution that was ultimately decided on was the communicator. Formally, a communicator is a pair of objects: the first is a group or ordered collection of processes, and the second is a context, which can be viewed as a unique, system-defined tag. Every communication function in MPI takes a communicator argument, and a communication can succeed only if all the processes participating in the communication use the same communicator argument. Thus, a library can either require that its functions be passed a unique library-specific communicator, or its functions can create their own unique communicator. In either case, it is straightforward for the library designer and the user to make certain that their messages are not confused.

For example, suppose now that the user's code is sending a float, x, from process A to process B, while the library is sending a float, y, from A to B:

    /* Assume system provides some buffering */

    void User_function(int my_rank, float* x) {
        MPI_Status status;
        if (my_rank == A) {
            /* MPI_COMM_WORLD is pre-defined in MPI */
            MPI_Send(x, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);
        } else if (my_rank == B) {
            MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
        }
        . . .
    }

    void Library_function(float* y) {
        MPI_Comm library_comm;
        MPI_Status status;
        int my_rank;

        /* Create a communicator with the same group  */
        /* as MPI_COMM_WORLD, but a different context */
        MPI_Comm_dup(MPI_COMM_WORLD, &library_comm);
        
        /* Get process rank in new communicator */
        MPI_Comm_rank(library_comm, &my_rank);

        if (my_rank == A)
            MPI_Send(y, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);
        } else if (my_rank == B) {
            MPI_Recv(y, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
        }
        . . .
    }

    int main(int argc, char* argv[]) {
        . . .
        if (my_rank == A) {
            User_function(A, &x);
            . . .
            Library_function(&y);
        } else if (my_rank == B) {
            Library_function(&y);
            . . .
            User_function(B, &x);
        }
        . . .
    }

2.2. User-defined datatypes

Prior to the advent of MPI, it was necessary in many parallel systems to send noncontiguous data or data composed of different types in separate messages or to first pack the data into a single message and then unpack the data when it was received. Indeed, this may have been necessary even if the communications subsystem allowed applications to gather data when a message was sent and to scatter data when a message was received. The second main innovation in MPI, user-defined datatypes, allows programmers to exploit this power, and as a consequence, to create messages consisting of logically unified sets of data rather than only physically contiguous blocks of data.

Loosely, an MPI datatype is a sequence of displacements in memory together with a collection of basic datatypes (e.g., int, float, double, and char). Thus, an MPI-datatype specifies the layout in memory of data to be collected into a single message or data to be distributed from a single message. For example, suppose we specify a sparse matrix entry with the following definition.

    typedef struct { 
        double entry;
        int row, col;
    } mat_entry_t;

MPI provides functions for creating a variable that stores the layout in memory of a variable of type mat_entry_t. One does this by first defining an MPI datatype
    MPI_Datatype mat_entry_mpi_t; 

to be used in communication functions, and then calling various MPI functions to initialize mat_entry_mpi_t so that it contains the required layout. Then, if we define
    mat_entry_t x;

we can send x by simply calling
    MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm);

and we can receive x with a similar call to MPI_Recv.

3. Why MPI

There are currently two standards for programming parallel systems that were developed in open forums: High Performance Fortran (HPF) [HPF] and MPI [MPI]. HPF relies on advanced compiler technology to expedite the development of data-parallel programs. Thus, although it is based on Fortran, HPF is a new language, and hence requires the construction of new compilers. As a consequence each implementation of HPF is, to a great extent, hardware specific, and until recently there were very few complete HPF implementations. Furthermore most of the current implementations are proprietary and quite expensive.

On the other hand, MPI, as we noted earlier, specifies a library of extensions to C and Fortran that can be used to write message passing programs. So an implementation of MPI can make use of existing compilers, and it is possible to develop more-or-less portable MPI libraries. Thus, unlike HPF, it is relatively easy to find an MPI library that will run on existing hardware. Indeed, there are at least three implementations [Gropp,OSC,Bruce] that run on networks of Unix workstations, and there are two implementations [Meyer,Marinho] that run on Windows PC's. Further, all of these implementations can be freely downloaded from the internet, and all but one of them comes with source code.

Perhaps the most important difference between HPF and MPI lies in the fact that HPF has been written for the express purpose of writing data-parallel programs, and, as a consequence it is not well-suited for dealing with irregular data-structures or control-parallel programs. Message passing, on the other hand, is a completely general method for parallel programming. Indeed, the generality and ready availability of MPI have made it one of the most widely used systems for parallel programming.

Ideally, a parallel computing class would introduce students to both HPF and MPI. However, it is a truism that learning to program parallel systems is extremely difficult, and we believe that attempting to learn two essentially different methodologies in one semester or one quarter will, especially for the weaker students, make it virtually impossible to learn either. Thus in view of the difficulties involved in obtaining an HPF compiler and the generality of MPI, we have chosen to use MPI.

4. Using MPI in Parallel Algorithms

The parallel algorithms class at USF is a semester-long class for upper-level undergraduates and beginning graduate students. In it we used a draft version of the MPI text [Pacheco97] as a supplement to a parallel algorithms text [Kumar]. Since our computer science students learn to program in C++, all the code that we present in the lectures uses C rather than Fortran.

The course begins with an overview of parallel computing (chapter 2 in [Pacheco97]) and continues with a brief introduction to message passing and MPI. The next block of lectures forms a transition into a more or less standard parallel algorithms course. We first discuss serial and parallel versions of a very simple computation -- e.g., dot product. In the course of analysing the performance of these algorithms, we develop the concepts of speedup and efficiency. The deterioration of the performance of the parallel algorithm as the number of processes is increased leads naturally into a discussion of Amdahl's Law and scalability. The remainder of the lectures are devoted to the development and analysis of a variety of standard parallel algorithms from linear algebra, some algorithms for searching and sorting, and graph algorithms. See [Pacheco96] for further details.

Much of the detailed introductory material on MPI is introduced in the lab through the tutorial chapters (3 - 7) in [Pacheco97]. More advanced special functions from MPI that facilitate implementation of the algorithms are introduced in the lectures as necessary. For example, in the discussion of dynamic load balancing, we discuss nonblocking communication and MPI_Iprobe, which can be used to test for the arrival of unsolicited messages.

4.1. Coursework

The coursework consists of a midterm, a final exam and four programming assignments. The purpose of the first assignment is simply to acquaint students with the system; in it, they write a ping-pong code, and estimate the bandwidth and latency of the system. The second and third assignments are more substantial. In the second, the students implement Jacobi's method for solving a linear system, and in the third, the students implement parallel bitonic sort. The last assignment is to be a project of the student's choosing. However, most students ask for suggestions, and most implement a version of parallel tree search that uses dynamic load balancing. See [Pacheco96] for further details.

The students develop their programs on a network of SGI's and an nCUBE 2 with a Sun workstation host. Both systems run the mpich [Gropp] implementation of MPI. Although both the LAM [OSC] and Chimp [Bruce] implementations are available on the network of SGI's, the students are encouraged to use mpich since it is the only implementation that allows full access to stdio from all processes. This seems to outweigh the benefit of access to better debugging facilities with the LAM implementation, since it seems that most students don't make use of any debugging facilities beyond printf and fflush. Furthermore, when they do try to use debuggers, they frequently get confused by the details of how to use them.

4.2. Results

Except for the first assignment, the students found the programs to be quite difficult. Most of the difficulties, however, had little to do with MPI. We had taught similar classes twice previously, and in both of the previous classes we used the nCUBE library of message-passing functions. In both the MPI-based class and the nCUBE-based classes, the students had little difficulty understanding the use of basic point-to-point and collective communication functions. Rather the most serious difficulties that students had with actually writing code were the differences between C and C++. Virtually every student in the class needed help with parameter passing in C: it seems that since C++ allows reference parameters, students had never learned about passing pointers.

The two aspects of MPI that caused the most difficulty were its prohibition on aliasing of arguments and the use of user-defined MPI datatypes. Roughly, the prohibition on aliasing of arguments in MPI says that a single variable cannot be passed as an argument to two distinct formal parameters of an MPI function. For example, if one wishes to distribute an array of floats stored on process 0 among all the processes, the following function call is illegal in MPI:

    float x[MAX];
    MPI_Scatter(x, MAX, MPI_FLOAT, x, MAX/num_procs, MPI_FLOAT,
        0, MPI_COMM_WORLD;

MPI requires that distinct arguments be passed to the first and fourth formal parameters.

User-defined datatypes seem to cause difficulty because the rules for type-matching in MPI are fairly involved, and they increase the complexity of message-passing. Getting user-defined datatypes to match adds one more potential problem to the already difficult task of sending messages.

This last point should be stressed: the greatest difficulty the students encountered didn't involve details of syntax; rather their greatest difficulty was with the logic of handling message-passing.

5. Using MPI in a Project Class

The applied mathematics research laboratory [Pacheco94] at USF is a class in which seniors and beginning graduate students spend a year working on an industry or government sponsored project. In the academic year 1994-1995 IBM and nCUBE sponsored a project in which the students worked on parallelizing the speech recognition program Sphinx [Lee]. Previous projects have involved parallelizing circuit simulation programs and developing parallel libraries for the iterative solution of sparse linear systems.

The first semester was organized as a seminar. It began with lectures to the students on the basics of signal processing, speech recognition, and parallel computing. During the remainder of the semester, the students took turns giving presentations on more specialized aspects of these topics. During the second semester, there were no formal classes. The students were given a broad outline for attacking the project, and they divided into groups to work on various aspects of the project. The entire group met once or twice a week to discuss problems and progress. See [Pacheco94] for details.

The vast majority of the work during the first two-thirds of the second semester involved programming. First, the students worked on understanding the serial code, and then they worked on parallelizing it. During the final third of the second semester, most of the students moved from working on the code to developing a multimedia presentation on the project for the sponsors and the University.

5.1. Programming and MPI

In order to learn about parallel programming, during the first semester the students were assigned two small, individual programming projects. For the first they parallelized the trapezoidal rule and for the second they parallelized a brute-force solution to the travelling salesman problem. Before the first assignment, there was a lecture on those parts of MPI that they would need for the assignment, and for the second assignment they were given a preliminary version of A User's Guide to MPI [Pacheco95]. See [Pacheco94] for further details.

When the students started working on parallelizing Sphinx during the second semester, their main resources for information about MPI were the instructor, the User's Guide and the on-line MPI man pages. The students developed the program on an nCUBE 2 running the mpich implementation [Gropp] of MPI. All coding was done in C. The code itself involved a parallel tree search with a static partitioning of the tree.

5.2. Results

The students' difficulties with the technical details of programming were similar to those encountered in the parallel algorithms class. They had difficulty with the differences between parameter passing in C and C++, the prohibition on aliasing in MPI, and the use of user-defined MPI datatypes.

The project itself was highly successful. The students obtained nearly linear speedup with 16 or fewer processors, and after seeing the final presentation, the sponsors were so impressed that they said that they would be happy to offer jobs to all of the students.

6. How MPI-2 Addresses Problems with MPI

The two biggest drawbacks to using MPI in a class are the fact that programming must be done in C or Fortran and the fact that the standard fails to specify any I/O capabilities. As any teacher of computer science knows, C and Fortran are not very good languages for learning to program. Further, the fact that C++ is an extension of C leads the unwary to believe that if they know how to program in C++, they also know how to program in C -- a conclusion which we found to be unwarranted in every case.

The fact that MPI failed to specify I/O means, unfortunately, that any MPI program that carries out I/O will not be portable. We alluded to this problem earlier, when we mentioned that neither the Chimp [Bruce] implementation nor the (earlier) LAM [OSC] implementation allow every process access to stdio. Fortunately, this problem was somewhat mitigated by the fact that the mpich implementations that we used do allow each process full access to stdio.

There are, of course, other objections to using MPI: it makes no provision for dynamic process management, and hence dynamic client-server applications are poorly suited to MPI. It also makes no provision for one-sided communications, and one-sided communications are currently showing great promise in reducing the high cost of communication.

Fortunately, most of these problems are being addressed in MPI-2 [MPI-2]. After the introduction of MPI, the MPI Forum continued its work on developing a portable standard for parallel computing. Its main purpose is to address deficiencies in the current standard. It will include specifications for parallel I/O, one-sided communications, and dynamic process management. It will also include C++ bindings for all of the MPI functions. The final MPI-2 specification is scheduled for completion by June, 1997. So by the fall of 1997, it should be possible to include many of these features in an MPI-based parallel computing class.

7. Conclusions

In general students seem to like MPI. In both classes, they made extensive use of its collective communication functions: they wanted to use them, even when it meant that they would have to learn about complex user-defined datatypes. They seemed to find the concept of a communicator quite natural and the details of using communicators in programs caused few problems. Those students that experimented with more esoteric functions, such as nonblocking communication and MPI_Iprobe, didn't seem to have much difficulty. A notable exception to this was the use of user-defined MPI datatypes. Although all of the students eventually used them successfully, it seems that they are peculiarly susceptible to extremely obscure bugs. There is some software designed to help with use of user-defined datatypes [May]. In the future we plan to make this software available to the students.

MPI can be readily incorporated into the curriculum of any parallel computing class that uses message-passing programming. Since code, development tools, and methods for developing programs in MPI are all portable, its incorporation should greatly benefit both students and instructors. The fact that there are freely available, high-quality implementations of MPI that run on virtually all platforms makes it very practical to use MPI in parallel computing classes, and when these observations are combined with the fact that MPI-2 addresses the main limitations to MPI, we have no hesitation in strongly endorsing its use in any parallel computing class.

References

[Bruce]
R. Alasdair Bruce, James G. Mills, and A. Gordon Smith. Chimp/MPI User Guide. Online document available as URL ftp://ftp.epcc.ed.ac.uk/pub/chimp/release/doc/user.ps.Z. 1994.
[Choi]
J. Choi, J. Dongarra, R. Pozo, and D. Walker. LAPACK Working Note 55. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. Online document available as URL http://www.netlib.org/lapack/lawns/lawn55.ps. 1992.
[Gropp]
William Gropp and Ewing Lusk. User's Guide for mpich, a Portable Implementation of MPI. Online document available as URL ftp://info.mcs.anl.gov/pub/mpi/userguide.ps.Z. 1997.
[HPF]
High Performance Fortran Forum. High Performance Fortran Language Specification, version 1.1. Online document available as URL http://www.crpc.rice.edu/HPFF/hpf1/hpf-v11/hpf-report.html. November, 1994.
[Kumar]
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.
[Lee]
Kai-Fu Lee. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers, 1989.
[Marinho]
Jose Marinho. ``W32MPI Home Page.'' Online document available as URL http://dsg.dei.uc.pt/~fafe/w32mpi.
[May]
John May. MPIMap User's Guide. Online document available as URL ftp://coral.llnl.gov/pub/mpimap/MPIMap.tar.Z. December, 1995.
[MPI]
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, version 1.1. Online document available as URL http://www.mcs.anl.gov/mpi/mpi-report-1.1/mpi-report.html. June 12, 1995.
[MPI-2]
__________. MPI-2: Extensions to the Message-Passing Interface. Online document available as URL http://www.mcs.anl.gov/Projects/mpi/mpi2/mpi2-report/mpi2-report.html. 1997.
[Meyer]
Joerg Meyer and Hesham El-Rewini. WinMPI v0.99b. Online document available as URL ftp://csftp.unomaha.edu/pub/rewini/WinMPI/thesis.ps.Z. July, 1995.
[nCUBE]
nCUBE Corporation. nCUBE 2 Programmers Guide, release 3.2. 1993.
[OSC]
Ohio Supercomputer Center. MPI Primer/Developing with LAM. Online document available as URL ftp://ftp.osc.edu/pub/lam/lam61.doc.ps.Z. 1997.
[Pacheco94]
Peter Pacheco. ``Course Materials from the Applied Math Lab.'' Online document available as URL http://www.usfca.edu/mpi/fpcc/project.tar.Z. 1994-1995.
[Pacheco95]
__________. A User's Guide to MPI. Online document available as URL ftp://math.usfca.edu/pub/MPI/mpi.guide.ps.Z. March, 1995.
[Pacheco96]
__________. ``Course Materials from Parallel Algorithms.'' Online document available as URL http://www.usfca.edu/mpi/fpcc/algorithms.tar.Z. 1996.
[Pacheco97]
__________. Parallel Programming with MPI. Morgan Kaufmann Publishers, 1997. Supplementary materials online at URL http://www.usfca.edu/mpi.
[Pierce]
Paul Pierce. ``The NX/2 Operating System.'' Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, ACM, 1988, pp. 384 - 390.
[vandeGeijn]
Robert van de Geijn and Jerrell Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. University of Texas, Department of Computer Sciences Technical Report TR-95-13, April, 1995.