In our work on out-of-core columnsort, we observed that each pass was actually a software pipeline, in which stages mapped to threads and buffers traverse the pipeline. We found that programming the details of thread creation and destruction, as well as the coordination among threads (via semaphores), was tedious and difficult to debug, however.
To simplify the task of developing asynchronous software pipelines that pass buffers among stages, we developed middleware that we call Asynchronous Buffered Computation Design and Engineering Framework Generator (ABCDEFG), or FG for short. In an FG pipeline, while one stage accesses high-latency data, such as data on disk, other stages can make progress.
Each stage of an FG pipeline is a synchronous function written in C++. A stage accepts a buffer from its predecessor stage, operates on that buffer, and conveys the buffer to its successor stage. Buffers come from a pool of buffers allocated by FG at the start of pipeline execution. FG-added source and sink stages recycle buffers back through the pipeline so that the buffer pool may be much smaller than the total amount of data being processed and so that the program can avoid the overhead inherent in repeated calls to new and dispose. (These C++ operations are particularly expensive in multithreaded programs, such as those produced by FG.)
The programmer specifies the pipeline structure by using basic features of C++. An object represents each stage. An object also represents each thread, but if each stage maps to its own thread, the programmer need not represent any threads; FG does so on its own. The pipeline structure appears in a null-terminated array of stage objects. The programmer also specifies which C++ function each stage runs, how many buffers to create, and the size of each buffer. FG handles the rest. It creates and destroys threads, runs the pipeline, and handles all synchronization between adjacent stages.
We have found that programs written in FG are smaller, simpler, and faster than their counterparts written with static asynchronous scheduling or even with explicit pthreads calls [CD04,DC05b].
Newer releases of FG provide even more capability [DC05a]. We include optimizations to improve performance. Stages may be replicated, either statically by the programmer or dynamically by FG itself. FG also now alters thread priorities to use resources more efficiently; again, this action may be initiated by either the programmer or FG. FG now also suits programs that do not fit a simple, linear pipeline structure. To extend the range of computations that fit into its framework, FG now incorporates fork-join and DAG structures. Not only do these structures allow for more programs to be designed for FG, but they also can enable significant performance improvements over linear pipeline structures.