Prior to the start of this award, we had begun to investigate methods for out-of-core sorting on distributed-memory clusters. During the award period, we made great strides in designing and implementing oblivious algorithms for out-of-core sorting. An oblivious sorting algorithm is one in which the sequence of operations is independent of the values being sorted. The advantage of an oblivious algorithm in out-of-core sorting is that the entire sequence of disk I/O and interprocessor communication operations is known in advance; hence, an implementation can overlap computation, communication, and I/O in order to mitigate the costs of these operations.
We developed several out-of-core sorting algorithms based on the oblivious columnsort algorithm of Leighton [Lei85]. Our first implementation [CCW01] used asynchronous I/O calls and static scheduling to overlap I/O with computation and communication. It sorted in four passes over the data, subject to an upper limit on the number of records being sorted. (Each pass reads each record once and writes each record once.)
Our next implementations [CC02] used the standard pthreads interface to fully redesign the structure of each pass. We constructed each pass as multiple stages of a software pipeline, passing buffers from stage to stage. Each stage was a thread. This structure offered two significant advantages over the first implemention. One was that the structure of the code simplified because each stage operated synchronously rather than asynchronously. The second advantage was that computation, communication, and I/O operations were scheduled dynamically, rather than statically, by the thread scheduler. The resulting code was both simpler and significantly faster than the first implementation. In the same paper, we showed how to reduce the number of passes from four down to three, with no change in the upper limit on problem size.
Subsequent work showed how to increase the upper limit on the number of records to sort. We did so by two different approaches. One was an algorithm-engineering view, in which we took a different interpretation of what a column in columnsort comprises. The second approach was an algorithmic approach, in which we devised several modifications to columnsort to relax the problem-size bound on the algorithm itself, with the relaxed bound carrying over directly to the out-of-core adaptation. This work appears in [Cha04,CC03,CC04,CCH,CHC03].