Second Forum on Parallel Computing Curricula
Sunday, June 22, 1997
Newport, RI
One of the greatest challenges for parallel computing education is to bridge the great divide between parallel software and hardware. Current parallel computing education is marked by unrealistic algorithmic models at one extreme and a complete lack of models at the other. In order to unify a parallel computing curriculum, it would be desirable to focus on a model that can realistically guide the development of both software and hardware. We argue that the BSP model is a promising candidate for this role, and discuss both positive and negative experiences teaching a graduate-level special-topics course based on the BSP model.
Most parallel computing courses take one of two approaches. The first approach is to focus on theoretical aspects and investigate a work-depth or PRAM framework for algorithm analysis. Unfortunately, since these abstract models typically have an unrealistically optimistic view of communication costs, the practical significance of this strategy, at least in the short term, is questionable. The second approach is to focus on efficient algorithms for existing machines. This also has its drawbacks, as instructors and students are likely to have the disconcerting feeling that the resulting algorithms are not fundamental and will have little value once the next parallel architecture comes along.
One of the fundamental goals of parallel computing is to settle on a ``unifying'' model of a parallel machine that provides a pragmatic compromise between technological reality and programmability. Such a model could serve as the foundation for the development of a cohesive parallel computing curriculum. Parallel computing could be incorporated in a well-justified way into standard architecture, algorithm, and language courses.
We argue that the BSP model [Val1 90] is a promising candidate for such a unifying model. More specifically:
A BSP computer consists of a set of processors, each with its own local memory, and an interconnection network that can route packets of some fixed size between processors. The computation is divided into supersteps. In each superstep, a processor can perform operations on local data, send packets, and receive packets. A packet sent in one superstep is delivered to the destination processor at the beginning of the next superstep. Consecutive supersteps are separated by a barrier synchronization of all processors.
The communication time of an algorithm in the BSP model is given by a simple cost function. The three basic parameters that model a parallel machine are: (i) the number of processors p, (ii) the gap g, which reflects network bandwidth on a per-processor basis, and (iii) the latency L, which is the minimum duration of a superstep, and which reflects the latency to send a packet through the network as well as the overhead to perform a barrier synchronization.
Then the execution time for superstep is given as:
where w is the "depth" of the superstep (the largest amount of local computation performed) and h the largest number of packets sent or received by any processor during the superstep. (This routing problem is called an h-relation.) The total execution time for the program is the sum of all the superstep times.
Efficient programming of a BSP machine, therefore, is based on several simple principles. To minimize the execution time, the programmer must (i) minimize the depth of the program, (ii) minimize the maximum number of packets sent or received by any processor in each superstep, and (iii) minimize the total number of supersteps in the program. In practice, these objectives can conflict, and trade-offs must be made. The correct trade-offs can be selected by taking into account the particular g and L parameters of the underlying machine.
A number of other models for general-purpose parallel computing have been proposed in recent years. Maggs, Matheson, and Tarjan provided an interesting survey and analysis [MMT 95]. One model based on asynchronous message passing is the LogP model [CKP+ 93]. LogP models the performance of point-to-point messages with three parameters representing software overhead, network latency, and communication bandwidth. The LogP model has been used as a performance model for Active Messages [VCGS 92] and the Split-C language [CDG+ 93], where it has been applied to the analysis of several algorithms. A quantitative comparison of the BSP and LogP models can be found in [BHP+ 96]. This study concludes that the two models are substantially equivalent in terms of asymptotic analysis, but that BSP may be considered somewhat preferable due to its simpler programming environment.
Other related models are the Postal Model [BK 92], the Atomic Model [LAB 93], and several models for end-point contention (e.g., see [ABK 95]) inspired by the prospect of optical communication in parallel machines. Like BSP and LogP, these models do not refer to the topology of the underlying machine, but assume that the interconnection network behaves essentially like a completely connected network, with the only contention arising at the processor-network interface.
Although the long-term goal is incorporation of parallel computing based on a unifying model throughout the computer science curriculum, early pedagogical research should investigate the use of the unifying model in a much smaller context. The rest of this paper describes the BSP model in more detail and discusses the successes and failures of a graduate-level course taught at the University of Central Florida, entitled ``Special Topics: The BSP Model for Portable & Efficient Parallel Computing.'' This course was taught in Fall 1995 and again in Fall 1996.
The BSP model accommodates a variety of programming approaches. It has been shown at least in theory that BSP machines can efficiently emulate a number of other models of parallel computation, including several variants of the PRAM model. A related approach would be to compile code from a high-level parallel language (such as Nesl [Ble 96]) down to BSP code for efficient execution on realistic parallel machines. Such approaches are categorized as automatic programming. An alternative is to use direct programming: programming BSP machines directly. Direct programming of BSP machines is structured but quite user-friendly. In terms of programming, therefore, BSP incorporates much of the established approaches and directs us towards additional areas of interest.
Using automatic programming, programs written for the PRAM model are efficiently simulated on a BSP computer. In the case of PRAM algorithms, the key to efficient simulations is to write PRAM programs that assume more (virtual) PRAM processors than there are (physical) BSP processors. This is called exploiting parallel slackness. The fact that efficient PRAM simulations are possible is clearly an important argument in support of BSP, considering the level of acceptance already enjoyed by the PRAM model for algorithm development [Vis 92].
Automatic programming has certain practical limitations, however. Most current machines with large numbers of processors do not have sufficient bandwidth. Specifically, efficient simulations of PRAM algorithms require g to have a value close to 1 time step where a time step is defined as the time to do a single local operation. (Intuitively, when g=1 the cost of communication is the same as the cost of a local operation.) Furthermore, taking advantage of parallel slackness requires improved hardware support for fast context switching. Fortunately, there is no reason to believe that these technical limitation can not be surmounted by future architectures.
When using direct programming the programmer encodes an algorithm directly for a BSP machine. Examples of direct BSP algorithms can be found in [GV 94]. Direct programming has the advantage of providing greater efficiency, but puts more of a burden on the programmer, who must now take into account the costs of communication and synchronization. Still, direct BSP programming is fairly straightforward, and existing BSP libraries have very clear semantics. Here we discuss BSPlib, a recently proposed standard library [HMS+ 97]. BSPlib implementations are available through the Oxford BSP Toolset [Hil 96]. Numerous platforms are supported, including SunOS, Solaris, Linux, Cray T3D, SGI Challenge, and IBM SP2. Students can develop BSP code on a workstation, and the code can be ported to a parallel platform, or analyzed using the toolset's profiling tools. The profiling tools can be used to indicate the communication pattern and communication time as well as the overall computation time. Indications are that these tools and others will continue to be supported and refined.
BSPlib uses an SPMD format and
supports two modes of communication: Bulk Synchronous Message
Passing (BSMP) and Direct Remote Memory Access (DRMA).
BSMP uses explicit sending and receiving of messages, where all messages sent
in superstep i are received during superstep i + 1. Below is a
brief example. Each message consists of a tag and and a payload. In this
example tag size is set to the size of an integer. Payload size can vary with
each message.
In this code fragment, process 0 sends a message with an
integer tag (33) and
a payload with two doubles (1.8 and -2.9) to
process 1.
The bsp_send call sends the message. After the barrier
synchronization call (bsp_sync) the message can be received at
the destination process by first accessing the tag (bsp_get_tag)
and then transferring the payload to the destination
(bsp_move).
The status variable used in the bsp_get_tag call
returns the length of the payload.
The
bsp_move call also serves to flush this message from the input
buffer.
/* Code fragment demonstrating BSMP */
int pid;
int tag;
int status;
double payload[2];
...
if (pid == 0) {
tag = 33;
payload[0] = 1.8;
payload[1] = -2.9;
bsp_send(1, &tag, &payload, 2 * sizeof(double));
}
bsp_sync(); /* barrier synchronize */
if (pid == 1) {
bsp_get_tag(&status, &tag);
bsp_move(&payload, status);
}
For DRMA, processes can directly write into the memory of another processor
(that is, messages do not have to be explicitly received at the
destination).
Memory locations available for remote access must first be registered using
bsp_register. Then bsp_put can be used to transfer
data between registered memory locations. (A bsp_get function is also
available). A registered area cannot be used
until the following superstep, and data transfers do not occur until the
end of the superstep.
/* Code fragment demonstrating DRMA */
int pid;
double payload[2];
...
bsp_register(&payload[0]);
bsp_sync(); /* barrier synchronize */
if (pid == 0) {
payload[0] = 1.8;
payload[1] = -2.9;
bsp_put(1, &payload[0], &payload[0], 0, 2 * sizeof(double));
}
bsp_sync(); /* barrier synchronize */
The BSP model also provides unambiguous and practical design goals for architects. The routing of h-relations should be efficient (i.e., g should be small) and barrier synchronization should be efficient (i.e., L should be small). In addition, fast context switching is important for efficient PRAM simulations. All of these topics have been addressed in the literature.
On the topic of h-relations, numerous papers describe theoretical and experimental results for a variety of platforms. Some results for routing h-relations on switching networks (switches independent of processors) include:
For routing h-relations on communication networks (processors are switches), results include:
Other approaches require that the h-relation be routed in time that is optimal to within a factor of 1 + o(1). These are called ``1-optimal.'' Under BSP, the goal would be to design routing algorithms that minimize latency while essentially not sacrificing any bandwidth.
Many of these approaches assume the network can efficiently perform total-exchange routing, the routing problem for which each processor has one packet to send to every other processor. An h-relation would ideally require only h/p total-exchange rounds. Some results assuming random h-relations include:
In summary, though BSP may be viewed as an abstract model, it provides achievable design goals for architecture designers. These include support for efficient h-relation routing, barrier synchronization, and context switching.
The long-term goal of a cohesive parallel computing curriculum is predicated upon the eventual acceptance of some unifying model. BSP is just one contender, albeit one with promise. In Fall 1995 and Fall 1996, a course was taught at the University of Central Florida entitled ``Special Topics: The BSP Model for Portable & Efficient Parallel Computing.'' Though the primary purpose of this course was to introduce students to research on parallel computing models, it also serves as an interesting test-bed to examine the current strengths and weaknesses of BSP as an education tool. We highlight some of these important issues, both positive and negative, in this section.
BSP does not enjoy widespread acceptance at the present time. Furthermore, most of the available materials concerning BSP have not been simplified to the level of a tutorial. It was therefore felt that a special-topics course was appropriate. The course was run as a research discussion group, with students reading papers and leading discussions in turn. Grading was based on class participation and a project. The students were mainly graduate level. Previous experience with parallel computing was helpful but not required, and a certain facility with both algorithms and architectures was necessary.
In retrospect, this format was found to be more appropriate for the advanced students, and less appropriate for new graduate students and for undergraduates, who sometimes had difficulty gleaning the significance of the research papers.
One of the best aspects of the course was the fact that students were encouraged to formulate their own opinions on the extremely important open questions. The motivation for the course - the need for a unifying model - was easily recognized and accepted by the vast majority of students, particularly those with some experience programming actual parallel machines. However, it is clear that arriving at an appropriate unifying model is a difficult task. The reality of the situation is that theoretical and even experimental support for any particular model is not sufficient. Politics, previous experience, personal taste, and marketing inevitably play significant roles.
Throughout the course a variety of parallel models were discussed, including PRAM [FW 78], LogP [CKP+ 93], and DRAM [LM 88]. Although the course was biased toward the BSP model, it is interesting to note that by the end of the course the students left with widely differing opinions. Along with those who considered BSP a reasonable compromise, there were those who firmly supported the PRAM approach and those who believed machine-specific code was the only practical path.
One negative factor resulting from a focus on the the BSP model is that the model is still far from achieving widespread acceptance. The arguments in support of BSP are subtle and in many cases controversial. Inevitably, some existing parallel architectures and programming approaches will not be a good match for BSP (or any other proposed unifying model). Furthermore, at this time we are not aware of any architectures implemented specifically to support BSP computing. In particular, although the h-relation problem has been the subject of numerous theoretical papers, it has not had much influence on network implementations.
Despite these drawbacks, it is important to recognize that even if it is not viewed as an acceptable unifying model, BSP emphasizes principles of parallel computing that are valid for a range of realistic parallel platforms. For example, it promotes balancing and minimizing computation depth and communication, as well as latency hiding by bundling communication.
Certain resources were available as teaching aids. A useful introduction to BSP has been written by Skillicorn, Hill, and McColl [SHM 96]. For programming BSP code, the Fall 1995 course used the Green BSP Library [GLRT 95] while the Fall 1996 course used BSPlib [HMS+ 97]. The BSPlib implementations of Jon Hill [Hil 96] are particularly useful, in that they provide profiling and performance prediction tools.
It is not necessary to have access to a parallel machine to write BSP code and analyze its performance. BSPlib implementations exist for SunOS, Solaris, Linux, SGI, and other operating systems for single-processor platforms. It is possible to measure the depth, w, and h-relation size, h, of each superstep. If the g and L values of some parallel platform are known, the execution time can be predicted. Of course, the code can then be transported to actual parallel machines with BSPlib implementations.
Despite these resources, instructional material concerning BSP is scarce, and indeed this is probably the greatest weakness for any BSP educational program at this moment. There is a need for a textbook-level description of direct-BSP algorithms. Several are briefly described in [Val1 90] (e.g., matrix multiply, broadcast, FFT, sorting), and other algorithms are described in a variety of other papers (e.g., [GV 94]), but these have not been refined to a level appropriate for instructional purposes, particularly for an undergraduate-level course.
Much of the grading was based on student projects. Group projects were acceptable, but in most cases projects were done individually. A wide range of project topics are possible, since BSP relates to algorithms, architectures, and languages. Students were almost always able to choose topics closely related to their research interests.
Examples of project topics include:
In general, those projects that focused on the implementation of a BSP program were implemented on a single-processor workstation, and parameterized to some extent. In a few cases, the students progressed rapidly enough to execute their programs on actual parallel platforms before the end of the semester. Examples of machines that were used include a BBN Butterfly GP1000, a Maspar, and an SGI multiprocessor. Most of the results of such experiments could only be described as preliminary, since detailed knowledge of the underlying machine is generally necessary to fairly describe the experimental procedure. Still, it would appear that incorporating relatively high-quality experiments will become commonplace as BSP libraries standardize and library implementations mature.
The BSP model can potentially serve as the basis for a cohesive parallel computing curriculum. BSP can be viewed as a reasonable compromise between the needs of programmers, architects, and theorists. It allows for a variety of programming approaches and a variety of platform implementations. The BSP cost model provides well-defined and theoretically tractable design goals for both programmers and architects. The model has been shown to be of practical significance in the short term, while providing for an eventual evolution toward efficient simulation of PRAM algorithms.
Some difficulties using BSP as an educational tool at this time include the relative obscurity of the model and the need for a more refined infrastructure. It is hoped that BSPlib can serve as a relatively stable basis for homeworks, labs, and projects. However, there is a clear need for textbook-level descriptions of BSP algorithms that are suitable for instructional purposes. Furthermore, the fact that BSP has had little impact on current architectures hinders its utility in a parallel architecture course except in a theoretical context.
It is conjectured that BSP would also be suitable for undergraduate education. The programming style required for direct BSP programs, based on supersteps, is highly structured. The program design goals, based on balancing and minimizing both computation and communication depth, are intuitive. Although the model assigns some cost to communication, it hides from the user many of the low-level details of the communication network such as topology and contention avoidance.
The author wishes to gratefully acknowledge the anonymous reviewers. Their comments induced substantial revision and improvement.