abali:ibm370:
Bülent Abali, Bruce D. Gavril, Richard W. Hadsell, Linh Lam, and Brion Shimamoto. Many/370: A parallel computer prototype for I/O intensive applications. In Proceedings of the Sixth Annual Distributed-Memory Computer Conference, pages 728-730, 1991.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Describes a parallel IBM/370, where they attach several small 370s to a switch, and several disks to each 370. Not much in the way of striping.

abawajy:scheduling:
J. H. Abawajy. Performance analysis of parallel I/O scheduling approaches on cluster computing systems. In Workshop on Parallel I/O in Cluster Computing and Computational Grids, pages 724-729, Tokyo, May 2003. Carleton University, IEEE Computer Society Press. Organized at the IEEE/ACM International Symposium on Cluster Computing and the Grid 2003.

Abstract: As computation and communication hardware performance continue to rapidly increase, I/O represents a growing fraction of application execution time. This gap between the I/O subsystem and others is expected to increase in future since I/O performance is limited by physical motion. Therefore, it is imperative that novel techniques for improveing I/O performance be developed. Parallel I/O is a promising approach to alleviating this bottleneck. However, very little work exist with respect to scheduling parallel I/O operations explicitly. In this paper, we address the problem of effective management of parallel I/O in cluster computing systems by using appropriate I/O scheduling strategies. We propose two new I/O scheduling algorithms and compare them with two existing scheduling Approaches. The preliminary results show that the proposed policies outperform existing policies substantially.

Keywords: parallel I/O, I/O scheduling algorithms, pario-bib

abello:dimacs:
James Abello and Jeffrey Scott Vitter, editors. External Memory Algorithms and Visualization. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society Press, Providence, RI, 1999.

Keywords: parallel I/O, out-of-core algorithm, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

abello:graph:
James Abello, Adam L. Buchsbaum, and Jeffrey R. Westbrook. A functional approach to external memory graph algorithms. In Proceedings of the 6th Annual European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 332-343, Venice, Italy, August 1998. Springer-Verlag.

Keywords: out-of-core algorithm, graph, pario-bib

abu-safah:speedup:
Walid Abu-Safah, Harlan Husmann, and David Kuck. On Input/Output speed-up in tightly-coupled multiprocessors. IEEE Transactions on Computers, pages 520-530, 1986.

Keywords: parallel I/O, I/O, pario-bib

Comment: Derives formulas for the speedup with and without I/O considered and with parallel software and hardware format conversion. Considering I/O gives a more optimistic view of the speedup of a program assuming that the parallel version can use its I/O bandwidth as effectively as the serial processor. Concludes that, for a given number of processors, increasing the I/O bandwidth is the most effective way to speed up the program (over the format conversion improvements).

acharya:tuning:
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeffrey K. Hollingsworth, Joel Saltz, and Alan Sussman. Tuning the performance of I/O intensive parallel applications. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 15-27, Philadelphia, May 1996. ACM Press.

Abstract: Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four applications achieve application-level I/O rates of over 100 MB/s on 16 processors. The total volume of I/O required by the programs ranged from about 75 MB to over 200 GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and knowledge of future I/O requests. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

Keywords: parallel I/O, filesystem workload, parallel application, pario-bib

aggarwal:sorting:
Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116-1127, September 1988.

Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transfering $P$ blocks each containing $B$ records in a single time unit; the records in each block must be input from or output to $B$ contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for $P=1$ that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case $B = P = O(1)$.

Keywords: parallel I/O, sorting, pario-bib

Comment: Good comments on typical external sorts, and how big they are. Focuses on parallelism at the disk. They give tight theoretical bounds on the number of I/O's required to do external sorting and other problems (FFTs, matrix transpose, etc.). If $x$ is the number of blocks in the file and $y$ is the number of blocks that fit in memory, then the number of I/Os needed grows as $Θ (x \log x / \log y)$. If parallel transfers of $p$ blocks are allowed, speedup linear in $p$ is obtained.

agrawal:asynch:
Gagan Agrawal, Anurag Acharya, and Joel Saltz. An interprocedural framework for placement of asynchronous I/O operations. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 358-365, Philadelphia, PA, May 1996. ACM Press.

Keywords: compiler, I/O, pario-bib

Comment: Not really about parallel applications or parallel I/O, but I think it may be of interest to that community. They propose a compiler framework for a compiler to insert asynchronous I/O operations (start I/O, finish I/O), to satisfy the dependency constraints of the program.

aguilar:graph:
Jose Aguilar. A graph theoretical model for scheduling simultaneous I/O operations on parallel and distributed environments. Parallel Processing Letters, 12(1):113-126, March 2002.

Abstract: The motivation for the research presented here is to develop an approach for scheduling I/O operations in distributed/parallel computer systems. First, a general model for specifying the parallel I/O scheduling problem is developed. The model defines the I/O bandwidth for different parallel/distributed architectures. Then the model is used to establish an algorithm for scheduling I/O operations on these architectures.

Keywords: parallel I/O, scheduling, pario-bib

ali:enhancing:
Zeyad Ali and Qutaibah Malluhi. Enhancing data-intensive applications performance by tuning the distributed storage policies. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'04, volume 3, pages 1515-1522, Las Vegas, NV, June 2004.

Abstract: This paper describes the performance improvements achieved by a data-intensive application by controlling the storage policies and algorithms of a distributed storage system. The Network Storage Manager (NSM) is a mass distributed storage framework with a unique architecture that provides applications with the high-performance features they need. It also provides the standard most commonly used implementation for storage policies. Distributed Terrain Viewer (DTViewer) is an application that utilizes NSM architecture and for efficient and reliable data delivery. Moreover, it exploits NSM controllable architecture by plugging-in its application-specific optimized implementations. DTViewer overrides the default NSM policies that do not understand its sophisticated access patterns, partitioning, and storage layout requirements. Experimental results have show significant improvement achieved when the application-tailored implementation are used. Such speedups are not achievable on storage systems with no application control such as the Parallel Virtual File System PVFS. (44 Refs.)

Keywords: application-specific storage policies, pario-app, DTViewer, access patterns, data layout, pario-bib

allcock:grid:
Bill Allcock, Joe Bester, John Bresnahan, Ann L. Chervenak, Ian Foster, Carl Kesselman, Sam Meder, Veronika Nefedova, Darcy Quesnel, and Steven Tuecke. Data management and transfer in high-performance computational grid environments. Parallel Computing, 28(5):749-771, May 2002.

Abstract: An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data Grids provide essential infrastructure for such applications, much as the Internet provides essential services for applications such as e-mail and the Web. We describe here two services that we believe are fundamental to any Data Grid: reliable, high-speed transport and replica management. Our high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access. Our replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. We present the design of both services and also preliminary performance results. Our implementations exploit security and other services provided by the Globus Toolkit.

Keywords: computational grid, data transfer, network, I/O, pario-bib

allen:cactus:
Gabrielle Allen, Tom Goodale, Joan Massó, and Edward Seidel. The cactus computational toolkit and using distributed computing to collide neutron stars. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 57-61, Redondo Beach, CA, August 1999. IEEE Computer Society Press.

Abstract: We are developing a system for collaborative research and development for a distributed group of researchers at different institutions around the world. In a new paradigm for collaborative computational science, the computer code and supporting infrastructure itself becomes the collaborating instrument, just as an accelerator becomes the collaborating tool for large numbers of distributed researchers in particle physics. The design of this "Collaboratory" allows many users, with very different areas of expertise, to work coherently together, on distributed computers around the world. Different supercomputers may be used separately, or for problems exceeding the capacity of any single system, multiple supercomputers may be networked together through high speed gigabit networks. Central to this Collaboratory is a new type of community simulation code, called "Cactus". The scientific driving force behind this project is the simulation of Einstein's equations for studying black holes, gravitational waves, and neutron stars, which has brought together researchers in very different fields from many groups around the world to make advances in the study of relativity and astrophysics. But the system is also being developed to provide scientists and engineers, without expert knowledge of parallel or distributed computing, mesh refinement, and so on, with a simple framework for solving any system of partial differential equations on many parallel computer systems, from traditional supercomputers to networks of workstations.

Keywords: scientific application, grid, input/output, parallel-io, pario-bib

Comment: invited talk. They describe a computational toolkit (CACTUS) that allows developers to construct code modules (thorns) to plug into the core system (cactus flesh). The toolkit includes thorns for solving partial differential equations using MPI, parallel elliptic solvers, thorns for I/O using FlexIO or HDF5, and thorns for checkpointing. The talk showed results from a cactus code demo that ran at SC'98. The demo combined two tightly-connected supercomputers (one in Europe and one in America) using Globus to simulate the collision of two neutron stars.

alvarez:failures:
Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 62-72. IEEE Computer Society Press, May 1997.
See also later version alvarez:bfailures.

Keywords: fault tolerance, RAID, disk array, parallel I/O, pario-bib

alvarez:jminerva:
Guillermo A. Alvarez, Elizabeth Borowsky, Susie Go, Theodore H. Romer, Ralph Becker-Szendy, Richard Golding, Arif Merchant, Mirjana Spasojevic, Alistair Veitch, and John Wilkes. Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Transactions on Computer Systems, 19(4):483-518, November 2001.

Abstract: Enterprise-scale storage systems, which can contain hundreds of host computers and storage devices and up to tens of thousands of disks and logical volumes, are difficult to design. The volume of choices that need to be made is massive, and many choices have unforeseen interactions. Storage system design is tedious and complicated to do by hand, usually leading to solutions that are grossly over-provisioned, substantially under-performing or, in the worst case, both.To solve the configuration nightmare, we present minerva: a suite of tools for designing storage systems automatically. Minerva uses declarative specifications of application requirements and device capabilities; constraint-based formulations of the various sub-problems; and optimization techniques to explore the search space of possible solutions.This paper also explores and evaluates the design decisions that went into Minerva, using specialized micro- and macro-benchmarks. We show that Minerva can successfully handle a workload with substantial complexity (a decision-support database benchmark). Minerva created a 16-disk design in only a few minutes that achieved the same performance as a 30-disk system manually designed by human experts. Of equal importance, Minerva was able to predict the resulting system's performance before it was built.

Keywords: disk array, storage system, RAID, automatic design, parallel I/O, pario-bib

alverson:tera:
Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. The Tera computer system. In Proceedings of the 1990 ACM International Conference on Supercomputing, pages 1-6, 1990.

Keywords: parallel architecture, MIMD, NUMA, pario-bib

Comment: Interesting architecture. 3-d mesh of pipelined packet-switch nodes, e.g., 16x16x16 is 4096 nodes, with 256 procs, 512 memory units, 256 I/O cache units, and 256 I/O processors attached. 2816 remaining nodes are just switching nodes. Each processor is 64-bit custom chip with up to 128 simultaneous threads in execution. It alternates between ready threads, with a deep pipeline. Inter-instruction dependencies explicitly encoded by the compiler, stalling those threads until the appropriate time. Each thread has a complete set of registers! Memory units have 4-bit tags on each word, for full/empty and trap bits. Shared memory across the network: ``The Tera ISP-level architecture is UMA, even though the PMS-level architecture is NUMA. Put another way, the memory looks a single cycle away to the compiler writer.'' - Burton Smith. See also tera:brochure.

anderson:bserverless:
Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe Matthews, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 24, pages 364-385. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version anderson:serverless.

Keywords: file caching, distributed file system, pario-bib

Comment: Part of jin:io-book; reformatted version of anderson:serverless.

anderson:buttress:
Eric Anderson, Mahesh Kallahalla, Mustafa Uysal, and Ram Swaminathan. Buttress: A toolkit for flexible and high fidelity I/O benchmarking. In Proceedings of the USENIX FAST '04 Conference on File and Storage Technologies, pages 45-58, San Francisco, CA, March 2004. Hewlett-Packard Laboratories, USENIX Association.

Abstract: In benchmarking I/O systems, it is important to generate, accurately, the I/O access pattern that one is intending to generate. However, timing accuracy ( issuing I/Os at the desired time) at high I/O rates is difficult to achieve on stock operating systems. We currently lack tools to easily and accurately generate complex I/O workloads on modern storage systems. As a result, we may be introducing substantial errors in observed system metrics when we benchmark I/O systems using inaccurate tools for replaying traces or for producing synthetic workloads with known inter-arrival times.

In this paper, we demonstrate the need for timing accuracy for I/O benchmarking in the context of replaying I/O traces. We also quantitatively characterize the impact of error in issuing I/Os on measured system parameters. For instance, we show that the error in perceived I/O response times can be as much as +350% or -15% by using naive benchmarking tools that have timing inaccuracies. To address this problem, we present Buttress, a portable and flexible toolkit that can generate I/O workloads with microsecond accuracy at the I/O throughputs of high-end enterprise storage arrays. In particular, Buttress can issue I/O requests within 100µs of the desired issue time even at rates of 10000 I/Os per second (IOPS).

Keywords: benchmarking software, performance analysis, I/O access patterns, I/O workloads, pario-bib

Comment: Looks like a really cool piece of software. Generates I/O workloads by replaying I/O traces.

anderson:raid:
Eric Anderson, Ram Swaminathan, Alistair Veitch, Guillermo A. Alvarez, and John Wilkes. Selecting RAID levels for disk arrays. In Proceedings of the USENIX FAST '02 Conference on File and Storage Technologies, pages 189-202, Monterey, CA, January 2002. USENIX Association.

Abstract: Disk arrays have a myriad of configuration parameters that interact in counter-intuitive ways, and those interactions can have significant impacts on cost, performance, and reliability. Even after values for these parameters have been chosen, there are exponentially-many ways to map data onto the disk arrays' logical units. Meanwhile, the importance of correct choices is increasing: storage systems represent an growing fraction of total system cost, they need to respond more rapidly to changing needs, and there is less and less tolerance for mistakes. We believe that automatic design and configuration of storage systems is the only viable solution to these issues. To that end, we present a comparative study of a range of techniques for programmatically choosing the RAID levels to use in a disk array. Our simplest approaches are modeled on existing, manual rules of thumb: they "tag" data with a RAID level before determining the configuration of the array to which it is assigned. Our best approach simultaneously determines the RAID levels for the data, the array configuration, and the layout of data on that array. It operates as an optimization process with the twin goals of minimizing array cost while ensuring that storage workload performance requirements will be met. This approach produces robust solutions with an average cost/performance 14-17{PCT} better than the best results for the tagging schemes, and up to 150-200{PCT} better than their worst solutions. We believe that this is the first presentation and systematic analysis of a variety of novel, fully-automatic RAID-level selection techniques.

Keywords: file systems, pario-bib

anderson:serverless:
Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. ACM Transactions on Computer Systems, 14(1):41-79, February 1996.
See also later version anderson:bserverless.

Keywords: file caching, distributed file system, pario-bib

Comment: See anderson:serverless-sosp.

ap:enwrich:
Apratim Purakayastha, Carla Schlatter Ellis, and David Kotz. ENWRICH: a compute-processor write caching scheme for parallel file systems. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 55-68, Philadelphia, May 1996. ACM Press.
See also earlier version ap:enwrich-tr.

Abstract: Many parallel scientific applications need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance has not been able to keep up with substantial improvements in parallel-I/O hardware because of poor parallel file-system software. Many radical changes, both at the interface level and the implementation level, have recently been proposed. One such proposed interface is collective I/O, which allows parallel jobs to request transfer of large contiguous objects in a single request, thereby preserving useful semantic information that would otherwise be lost if the transfer were expressed as per-processor non-contiguous requests. Kotz has proposed disk-directed I/O as an efficient implementation technique for collective-I/O operations, where the compute processors make a single collective data-transfer request, and the I/O processors thereafter take full control of the actual data transfer, exploiting their detailed knowledge of the disk-layout to attain substantially improved performance.

Recent parallel file-system usage studies show that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In this paper, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design via simulated implementation and show that ENWRICH achieves high performance for various configurations and workloads.

Keywords: parallel file system, parallel I/O, caching, pario-bib, dfk

ap:enwrich-tr:
Apratim Purakayastha, Carla Schlatter Ellis, and David Kotz. ENWRICH: a compute-processor write caching scheme for parallel file systems. Technical Report CS-1995-22, Dept. of Computer Science, Duke University, October 1995.
See also later version ap:enwrich.

Abstract: Many parallel scientific applications need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance has not been able to keep up with substantial improvements in parallel-I/O hardware because of poor parallel file-system software. Many radical changes, both at the interface level and the implementation level, have recently been proposed. One such proposed interface is collective I/O, which allows parallel jobs to request transfer of large contiguous objects in a single request, thereby preserving useful semantic information that would otherwise be lost if the transfer were expressed as per-processor non-contiguous requests. Kotz has proposed disk-directed I/O as an efficient implementation technique for collective-I/O operations, where the compute processors make a single collective data-transfer request, and the I/O processors thereafter take full control of the actual data transfer, exploiting their detailed knowledge of the disk-layout to attain substantially improved performance.

Recent parallel file-system usage studies show that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In this paper, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design via simulated implementation and show that ENWRICH achieves high performance for various configurations and workloads.

Keywords: parallel file system, parallel I/O, caching, pario-bib, dfk

ap:thesis:
Apratim Purakayastha. Characterizing and Optimizing Parallel File Systems. PhD thesis, Dept. of Computer Science, Duke University, Durham, NC, June 1996. Also available as technical report CS-1996-10.

Abstract: High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. In the first part of this dissertation, we attempt to fill this void by measuring a real file-system workload on a production parallel machine, namely the CM-5 at the National Center for Supercomputing Applications. We collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for design of future parallel file systems. Our usage study showed that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In the second part of this dissertation, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. Within its framework, ENWRICH uses a recently proposed high performance implementation of collective I/O operations called disk-directed I/O, but it eliminates a number of limitations of disk-directed I/O. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface, and without the requirement for mapping libraries at the I/O processors. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design of ENWRICH using simulated implementation and extensive experimentation. We show that ENWRICH achieves high performance for various configurations and workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for certain workloads, and suggest possible enhancements. Finally, we discuss the nuances of implementing ENWRICH on a real platform and speculate about possible adaptations of ENWRICH for emerging multiprocessing platforms.

Keywords: parallel I/O, multiprocessor file system, file access patterns, workload characterization, file caching, disk-directed I/O, pario-bib

Comment: See also ap:enwrich, ap:workload, and nieuwejaar:workload

ap:workload:
Apratim Purakayastha, Carla Schlatter Ellis, David Kotz, Nils Nieuwejaar, and Michael Best. Characterizing parallel file-access patterns on a large-scale multiprocessor. In Proceedings of the Ninth International Parallel Processing Symposium, pages 165-172. IEEE Computer Society Press, April 1995.
See also earlier version ap:workload-tr.
See also later version nieuwejaar:workload-tr.

Abstract: High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, here we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.

Keywords: parallel I/O, file access pattern, multiprocessor file system, file system workload, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided.

ap:workload-tr:
Apratim Purakayastha, Carla Schlatter Ellis, David Kotz, Nils Nieuwejaar, and Michael Best. Characterizing parallel file-access patterns on a large-scale multiprocessor. Technical Report CS-1994-33, Dept. of Computer Science, Duke University, October 1994.
See also later version ap:workload.

Abstract: Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors.

Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines.

The results of our trace analysis lead to recommendations for parallel file system design. First, the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load condit ions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests, predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs, appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.

Keywords: parallel I/O, file access pattern, multiprocessor file system, file system workload, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided.

arendt:genome:
James W. Arendt. Parallel genome sequence comparison using a concurrent file system. Technical Report UIUCDCS-R-91-1674, University of Illinois at Urbana-Champaign, 1991.

Keywords: parallel file system, parallel I/O, Intel iPSC/2, pario-bib

Comment: Studies the performance of Intel CFS. Uses an application that reads in a huge file of records, each a genome sequence, and compares each sequence against a given sequence. Looks at cache performance, message latency, cost of prefetches and directory reads, and throughput. He tries one-disk, one-proc transfer rates. Because of contention with the directory server on one of the two I/O nodes, it was faster to put all of the file on the other I/O node. Striping is good for multiple readers. Best access pattern was interleaved, not segmented or separate files, because it avoided disk seeks. Perhaps the files are stored contiguously? Can get good speedup by reading the sequences in big integral record sizes, from CFS, using a load-balancing scheduled by the host. Contention for directory blocks - through single-node directory server.

arge:GIS:
Lars Arge. External-memory algorithms with applications in GIS. In Marc van Kreveld, Jurg Nievergelt, Thomas Roos, and Peter Widmayer, editors, Algorithmic foundations of geographic information systems, volume 1340 of Lecture Notes in Computer Science, pages 213-254. Springer-Verlag, 1997.

Abstract: The paper presents a survey of the basic paradigms for designing efficient external-memory algorithms and especially for designing external-memory algorithms for computational geometry problems with applications in GIS. As the area of external-memory algorithms is relatively young the paper focuses on fundamental external-memory design techniques more than on algorithms for specific GIS problems. The presentation is survey-like with a more detailed discussion of the most important techniques and algorithms.

Keywords: out-of-core algorithm, geographic information system, GIS, pario-bib

Comment: not parallel? but mentions some parallel disk stuff.

arge:jsegments:
Lars Arge, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory algorithms for processing line segments in geographic information systems. Algorithmica, 1998. To appear.
See also earlier version arge:segments.

Abstract: We present a set of algorithms designed to solve large-scale geometric problems involving collections of line segments in the plane. Geographical information systems (GIS) handle large amounts of spatial data, and at some level the data is often manipulated as collections of line segments. NASA's EOS project is an example of a GIS that is expected to store and manipulate petabytes (thousands of terabytes, or millions of gigabytes) of data! In the design of algorithms for this type of large-scale application, it is essential to consider the problem of minimizing I/O communication, which is the bottleneck.

In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them. To solve these problems, we combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. We also develop a powerful new technique that can be regarded as a practical external memory version of fractional cascading. Except for the batched planar point location problem, no algorithms specifically designed for external memory were previously known for these problems. Our algorithms for triangulation and line segment intersection partially answer previously posed open problems, while the batched planar point location algorithm improves on the previously known solution, which applied only to monotone decompositions. Our algorithm for the red-blue line segment intersection problem is provably optimal.

Keywords: verify, out-of-core algorithm, computational geometry, pario-bib

Comment: Special issue on cartography and geographic information systems.

arge:lower:
Lars Arge and Peter Bro Miltersen. On showing lower bounds for external-memory computational geometry problems. In Abello and Vitter [abello:dimacs], pages 139-160.

Keywords: out-of-core algorithm, computational geometry, pario-bib

Comment: See also the component papers vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey. Not clear to what extent these papers are about *parallel* I/O.

arge:segments:
Lars Arge, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory algorithms for processing line segments in geographic information systems. In Proceedings of the Third European Symposium on Algorithms, volume 979 of Lecture Notes in Computer Science, pages 295-310, Corfu, Greece, September 1995. Springer-Verlag.
See also later version arge:jsegments.

Abstract: In the design of algorithms for large-scale applications it is essential to consider the problem of minimizing I/O communication. Geographical information systems (GIS) are good examples of such large-scale applications as they frequently handle huge amounts of spatial data. In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them.

Keywords: out-of-core algorithm, computational geometry, pario-bib

Comment: Does deal with parallel disks, though not in great detail.

arge:sorting:
Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. Sequence sorting in secondary storage. In Proceedings of Compression and Complexity of Sequences, pages 329-346, Salerno, Italy, June 1998. IEEE Computer Society Press.

Abstract: We investigate the I/O complexity of the problem of sorting sequences (or strings of characters) in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is Theta (K log/sub 2/ K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting sequences is Theta ((K/B)log/sub M/B/(K/B)+(N/B)), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where the strings are not allowed to be broken into their individual characters, and we show that the I/O complexity of string sorting in this model is Theta ((N/sub 1//B)log/sub M/B/(N/sub 1//B)+K/sub 2/+(N/B)), where N/sub 1/ is the total length of all strings shorter than B and K/sub 2/ is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms outside the comparison model.

Keywords: out-of-core algorithm, sorting algorithm, pario-bib

Comment: This paper is really the same paper as arge:sorting-strings.

arge:sorting-strings:
Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. On sorting strings in external memory. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages 540-548, El Paso, May 1997. ACM Press.

Abstract: In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is theta(K log K + N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is $θ(K/B log_(M/B) (K/B) + N/B)$, but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is $θ(N_1/B log_(M/B) (N_1/B) + K_2 log_(M/B) K_2 + N/B)$, where $N_1$ is the total length of all strings shorter than B and $K_2$ is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model.

Keywords: out-of-core algorithm, sorting, parallel I/O, pario-bib

Comment: Not parallel? But mentions some parallel disk stuff.

armen:disk-model:
Chris Armen. Bounds on the separation of two parallel disk models. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 122-127, Philadelphia, May 1996. ACM Press.

Abstract: The single-disk, D-head model of parallel I/O was introduced by Agarwal and Vitter to analyze algorithms for problem instances that are too large to fit in primary memory. Subsequently Vitter and Shriver proposed a more realistic model in which the disk space is partitioned into D disks, with a single head per disk. To date, each problem for which there is a known optimal algorithm for both models has the same asymptotic bounds on both models. Therefore, it has been unknown whether the models are equivalent or whether the single-disk model is strictly more powerful.

In this paper we provide evidence that the single-disk model is strictly more powerful. We prove a lower bound on any general simulation of the single-disk model on the multi-disk model and establish randomized and deterministic upper bounds. Let $N$ be the problem size and let $T$ be the number of parallel I/Os required by a program on the single-disk model. Then any simulation of this program on the multi-disk model will require $Ω\left(T \frac{\log(N/D)}{\log \log(N/D)}\right)$ parallel I/Os. This lower bound holds even if replication is allowed in the multi-disk model. We also show an $O\left(\frac{\log D}{\log \log D}\right)$ randomized upper bound and an $O\left(\log D (\log \log D)^2\right)$ deterministic upper bound. These results exploit an interesting analogy between the disk models and the PRAM and DCM models of parallel computation.

Keywords: parallel I/O, theory, parallel I/O algorithm, pario-bib

arpaci-dusseau:jriver:
H. Arpaci-Dusseau, Remzi. Run-time adaptation in River. ACM Transactions on Computer Systems, 21(1):36-86, February 2003.

Keywords: distributed query processing, dataflow, pario-bib

Comment: River is a dataflow programming environment for database query processing applications. River is specifically designed for clusters of computers with heterogeneous performance characteristics. The goal of the River runtime system is to adapt to "performance faults"-portions of the system that perform poorly by dynamically adjusting the transfer of data through the dataflow graph. River uses two constructs to build applications: a distributed queue that deals with performance faults by consumers, and graduated declustering that deals with performance faults of producers. A distributed queue pushes data through the dataflow graph at a rate proportional to the rate of consumption and adapts to changes in consumption rates. Graduated declustering deals with producer performance faults by reading from replicated producers. Although River is designed specifically for query processing, they briefly discuss how one might adapt scientific applications to work in their framework.

arpaci-dusseau:river:
Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 10-22, Atlanta, GA, May 1999. ACM Press.

Abstract: We introduce River, a data-flow programming environment and I/O substrate for clusters of computers. River is designed to provide maximum performance in the common case- even in the face of non-uniformities in hardware, software, and workload. River is based on two simple design features: a high-performance distributed queue,and a storage redundancy mechanism called graduated declustering.We have implemented a number of data-intensive applications on River, which validate our design with near-ideal performance in a variety of non-uniform performance scenarios.

Keywords: cluster computing, parallel I/O, pario-bib

arunachalam:prefetch:
Meenakshi Arunachalam, Alok Choudhary, and Brad Rullman. A prefetching prototype for the parallel file system on the Paragon. In Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 321-323, May 1995. Extended Abstract.

Keywords: parallel I/O, prefetching, parallel file system, pario-bib

Comment: A related paper is arunachalam:prefetch2.

arunachalam:prefetch2:
Meenkashi Arunachalam, Alok Choudhary, and Brad Rullman. Implementation and evaluation of prefetching in the Intel Paragon Parallel File System. In Proceedings of the Tenth International Parallel Processing Symposium, pages 554-559, April 1996.

Abstract: The significant difference between the speeds of the I/O system (e.g., disks) and compute processors in parallel systems creates a bottleneck that lowers the performance of an application that does a considerable amount of disk accesses. A major portion of the compute processors' time is wasted on waiting for I/O to complete. This problem can be addressed to a certain extent, if the necessary data can be fetched from the disk before the I/O call to the disk is issued. Fetching data ahead of time, known as prefetching in a multiprocessor environment depends a great deal on the application's access pattern. The subject of this paper is implementation and performance evaluation of a prefetching prototype in a production parallel file system on the Intel Paragon. Specifically, this paper presents a) design and implementation of a prefetching strategy in the parallel file system and b) performance measurements and evaluation of the file system with and without prefetching. The prototype is designed at the operating system level for the PFS. It is implemented in the PFS subsystem of the Intel Paragon Operating System. It is observed that in many cases prefetching provides considerable performance improvements. In some other cases no improvements or some performance degradation is observed due to the overheads incurred in prefetching.

Keywords: parallel I/O, prefetching, multiprocessor file system, pario-bib

Comment: See arunachalam:prefetch.

asami:bself:
Satoshi Asami, Nisha Talagala, and David A. Patterson. Designing a self-maintaining storage system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 30, pages 453-463. IEEE Computer Society Press and Wiley, New York, NY, 2001.
See also earlier version asami:self.

Keywords: parallel I/O, disk array, RAID, pario-bib

Comment: Part of jin:io-book; reformatted version of asami:self.

asami:self:
Satoshi Asami, Nisha Talagala, and David A. Patterson. Designing a self-maintaining storage system. In Proceedings of the Sixteenth IEEE Symposium on Mass Storage Systems, pages 222-233. IEEE Computer Society Press, March 1999.
See also later version asami:bself.

Abstract: This paper shows the suitability of a self-maintaining approach to Tertiary Disk, a large-scale disk array system built from commodity components. Instead of incurring the cost of custom hardware, we attempt to solve various problems by design and software. We have built a cluster of storage nodes connected by switched Ethernet. Each storage node is a PC hosting a few dozen SCSI disks, running the FreeBSD operating system. The system is used as a web-based image server for the Zoom Project in cooperation with the Fine Arts Museums of San Francisco (http://www.thinker.org/). We are designing self-maintenance extension to the OS to run on this cluster to mitigate the system administrator's burden. There are several components required for building self-maintaining system. One is decoupling the time failure from the time of hardware replacement. This implies the system must have some amount of redundancy, and has no single point of failure. Our system is fully redundant, and everything is constructed to avoid a single point of failure. Another is correctly identifying failures and their dependencies. The paper also outlines several approaches to lower the human cost of system administration of such a system and making the system as autonomous as possible.

Keywords: parallel I/O, disk array, RAID, pario-bib

asbury:fortranio:
Raymond K. Asbury and David S. Scott. FORTRAN I/O on the iPSC/2: Is there read after write?. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 129-132, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: parallel I/O, hypercube, Intel iPSC/2, file access pattern, pario-bib

asthana:active:
Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski. An experimental active memory based I/O subsystem. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 73-84. AT&T Bell Labs, April 1994. Also appeared in Computer Architecture News 22(4).
See also later version asthana:active-book.

Keywords: parallel I/O, architecture, pario-bib

Comment: They describe an I/O subsystem based on an ``active memory'' called SWIM (Structured Wafer-based Intelligent Memory). SWIM chips are RAM chips with some built-in processing. The idea is that these tiny processors can manipulate the data in the chip at full speed, without dealing with memory bus or off-chip costs. Further, the chips can work in parallel. They demonstrate how they've used this to build a national phone database server, a high-performance IP router, and a call-screening agent.

asthana:active-book:
Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski. An experimental memory-based I/O subsystem. In Jain et al. [iopads-book], chapter 17, pages 373-390.
See also earlier version asthana:active.

Abstract: We describe an I/O subsystem based on an active memory named SWIM (Structured Wafer-based Intelligent Memory) designed for efficient storage and manipulation of data structures. The key architectural idea in SWIM is to associate some processing logic with each memory chip that allows it to perform data manipulation operations locally and to communicate with a disk or a communication line through a backend port. The processing logic is specially designed to perform operations such as pointer dereferencing, memory indirection, searching and bounds checking efficiently. The I/O subsystem is built using an interconnected ensemble of such memory logic pairs. A complex processing task can now be distributed between a large number of small memory processors each doing a sub-task, while still retaining a common locus of control in the host CPU for higher level administrative and provisioning functions. We argue that active memory based processing enables more powerful, scalable and robust designs for storage and communications subsystems, that can support emerging network services, multimedia workstations and wireless PCS systems. A complete parallel hardware and software system constructed using an array of SWIM elements has been operational for over a year. We present results from application of SWIM to three network functions: a national phone database server, a high performance IP router, and a call screening agent.

Keywords: parallel I/O architecture, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

avalani:channels:
Bhavan Avalani, Alok Choudhary, Ian Foster, and Rakesh Kirshnaiyer. Integrating task and data parallelism using parallel I/O techniques. In Proceedings of the International Workshop on Parallel Processing, Bangalore, India, December 1994.

Keywords: parallel I/O, pario-bib

Comment: They describe using the techniques of delrosario and debenedictis (although without mentioning them) to provide for channels (parallel pipes) between independent data-parallel tasks. The technique really is the same as in debenedictus and delrosario, although they extend it a bit to allow multiple "files" within a channel (why not use multiple channels)? Also, they depend on the program to read and write synchronization variables to control access to the flow of data through the channel. While this may provide good performance in some cases, why not have support for automatic flow control? The system can detect when a portion of the channel is written, and release readers waiting on that portion of the channel (if any). The paper is a bit confusing in its use of the word "file", which seems to be used to mean different things at different points. Also, they seem to use an arbitrary distribution for the "file", which may or may not be the same as one of those used by the two endpoints.

baer:grid-io:
Troy Baer and Pete Wyckoff. A parallel I/O mechanism for distributed systems.. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 63-69, San Diego, CA, September 2004. IEEE Computer Society Press.

Abstract: Access to shared data is critical to the long term success of grids of distributed systems. As more parallel applications are being used on these grids, the need for some kind of parallel I/O facility across distributed systems increases. However, grid middleware has thus far had only limited support for distributed parallel I/O. In this paper, we present an implementation of the MPI-2 I/O interface using the Globus GridFTP client API. MPI is widely used for parallel computing, and its I/O interface maps onto a large variety of storage systems. The limitations of using GridFTP as an MPI-I/O transport mechanism are described, as well as support for parallel access to scientific data formats such as HDF and NetCDF. We compare the performance of GridFTP to that of NFS on the same network using several parallel I/O benchmarks. Our tests indicate that GridFTP can be a workable transport for parallel I/O, particularly for distributed read-only access to shared data sets. (26 refs.)

Keywords: grid I/O, MPI-I/O, grid middleware, gridFTP, pario-bib

bagrodia:sio-character:
Rajive Bagrodia, Andrew Chien, Yarsun Hsu, and Daniel Reed. Input/output: Instrumentation, characterization, modeling and management policy. Technical Report CCSF-41, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, pario-bib, prefetching, caching, multiprocessor file system, file access pattern

Comment: Basically there are two parts to this paper. First, they will instrument applications, Intel PFS, and IBM Vesta, to trace I/O-related activity. Then they'll use Pablo to analyze and characterize. They plan to trace some events in detail, and the rest with histogram counters. Second, they plan to develop caching and prefetching policies and to analyze those with simulation, analysis, and implementation. They note that IBM and Intel are developing parallel I/O architecture simulators. See also poole:sio-survey, choudhary:sio-language, bershad:sio-os.

bairavasundaram:x-ray:
Lakshmi N. Bairavasundaram, Muthian Sivathanu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. X-RAY: A non-invasive exclusive caching mechanism for RAIDs. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 176-187, Munich, Germany, June 2004. IEEE Computer Society Press.

Abstract: RAID storage arrays often possess gigabytes of RAM for caching disk blocks. Currently, most RAID systems use LRU or LRU-like policies to manage these caches. Since these array caches do not recognize the presence of file system buffer caches, they redundantly retain many of the same blocks as those cached by the file system, thereby wasting precious cache space. In this paper, we introduce X-RAY, an exclusive RAID array caching mechanism. X-RAY achieves a high degree of (but not perfect) exclusivity through gray-box methods: by observing which files have been accessed through updates to file system meta-data, X-RAY constructs an approximate image of the contents of the file system cache and uses that information to determine the exclusive set of blocks that should be cached by the array. We use microbenchmarks to demonstrate that X-RAY's prediction of the file system buffer cache contents is highly accurate, and trace-based simulation to show that X-RAY considerably outperforms LRU and performs as well as other more invasive approaches. The main strength of the X-RAY approach is that it is easy to deploy - all performance gains are achieved without changes to the SCSI protocol or the file system above.

Keywords: RAID, x-ray, caching policies, pario-bib

baird:disa:
R. Baird, S. Karamooz, and H. Vazire. Distributed information storage architecture. In Proceedings of the Twelfth IEEE Symposium on Mass Storage Systems, pages 145-155, 1993.

Keywords: parallel I/O, distributed file system, mass storage, pario-bib

Comment: Architecture for distributed information storage. Integrates file systems, databases, etc. Single system image, lots of support for administration. O-O model, with storage device objects, logical device objects, volume objects, and file objects. Methods for each type of object, including administrative methods.

bakker:semantic:
J.A. Bakker. Semantic partitioning as a basis for parallel I/O in database management systems. Parallel Computing, 26(11):1491-1513, October 2000.

Abstract: Modern applications such as `video on demand' require fast reading of complete files, which can be supported well by file striping. Many conventional applications, however, are only interested in some part of the available records. In order to avoid reading attributes irrelevant to such applications, each attribute could be stored in a separate (transposed) file; Aiming at I/O parallelism, byte-oriented striping could be applied to transposed files. However, such a fragmentation ignores the semantics of data. This fragmentation cannot be optimized by a database management system (DBMS) because a DBMS has to perform its tasks on the basis of data semantics. For example, queries must be translated into file operations using a scheme that maps a data model to a file system. However, details about files, such as the striping width, are invisible to a DBMS. Therefore, we propose to store each transposed file related to a composite type on a separate, independent disk drive, which means I/O parallelism tuned to a data model. As we also aim at system reliability and data availability, each transposed file must be duplicated on another drive. Consequently, a DBMS also has to guarantee correctness and completeness of the allocation of transposed files within an array of disk drives. As a solution independent of the underlying data model, we propose an abstract framework consisting of a meta model and a set of rules

Keywords: database, parallel I/O, pario-bib

baldwin:hyperfs:
C. H. Baldwin and W. C. Nestlerode. A large scale file processing application on a hypercube. In Proceedings of the Fifth Annual Distributed-Memory Computer Conference, pages 1400-1404, 1990.

Keywords: multiprocessor file system, file access pattern, parallel I/O, hypercube, pario-bib

Comment: Census-data processing on an nCUBE/10 at USC. Their program uses an interleaved pattern, which is like my lfp or gw with multi-record records (i.e., the application does its own blocking). Shifted to asynchronous I/O to do OBL manually. Better results if they did more computation per I/O (of course).

baptist:fft:
Lauren M. Baptist. Two algorithms for performing multidimensional, multiprocessor, out-of-core FFTs. Technical Report PCS-TR99-350, Dept. of Computer Science, Dartmouth College, Hanover, NH, June 1999.

Abstract: We show two algorithms for computing multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the Parallel Disk Model for implementation and analysis.

The first method is a straightforward out-of-core variant of a well-known method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the data for the 1-dimensional FFTs are contiguous.

The second method is an adaptation of another well-known method for in-core, multidimensional FFTs. This method computes all dimensions simultaneously. It is more difficult to generalize to arbitrary radices and number of dimensions in this method than in the first method. Our present implementation is therefore limited to two dimensions of equal size, that are again integer powers of 2.

We present I/O complexity analyses for both methods as well as empirical results for a DEC 2100 server and an SGI Origin 2000, each of which has a parallel disk system. Our results indicate that the methods are comparable in speed in two-dimensions.

Keywords: parallel I/O, out of core, FFT, parallel algorithm, scientific computing, pario-bib

Comment: Undergraduate Honors Thesis. Advisor: Tom Cormen.

barak:hfs:
Amnon Barak, Bernard A. Galler, and Yaron Farber. A holographic file system for a multicomputer with many disk nodes. Technical Report 88-6, Dept. of Computer Science, Hebrew University of Jerusalem, May 1988.

Keywords: parallel I/O, hashing, reliability, disk mirroring, pario-bib

Comment: Describes a file system for a distributed system that scatters records of each file over many disks using hash functions. The hash function is known by all processors, so no one processor must be up to access the file. Any portion of the file whose disknode is available may be accessed. Shadow nodes are used to take over for nodes that go down, saving the info for later use by the proper node. Intended to easily parallelize read/write accesses and global file operations, and to increase file availability.

barve:bus:
Rakesh Barve, Elizabeth Shriver, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, and Jeffrey Scott Vitter. Modeling and optimizing I/O throughput of multiple disks on a bus (summary). In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pages 264-265. ACM Press, June 1998.
See also later version barve:bus2.

Keywords: disk model, I/O bus, device model, I/O model, pario-bib

barve:bus2:
Rakesh Barve, Jeffrey Vitter, Elizabeth Shriver, Phillip Gibbons, Bruce Hillyer, and Yossi Matias. Modeling and optimizing I/O throughput of multiple disks on a bus. In Proceedings of the 1999 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 83-92. ACM Press, June 1999.
See also earlier version barve:bus.

Keywords: disk model, I/O bus, device model, I/O model, pario-bib

barve:competitive2:
Rakesh Barve, Mahesh Kallahalla, Peter J. Varman, and Jeffrey Scott Vitter. Competitive parallel disk prefetching and buffer management. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 47-56, San Jose, CA, November 1997. ACM Press.

Abstract: We provide a competitive analysis framework for online prefetching and buffer management algorithms in parallel I/O systems, using a read-once model of block references. This has widespread applicability to key I/O-bound applications such as external merging and concurrent playback of multiple video streams. Two realistic lookahead models, global lookahead and local lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently in existing systems. An important aspect of our work is that we show how to implement both the models of lookahead in practice using the simple techniques of forecasting and flushing.

Given a D-disk parallel I/O system and a globally shared I/O buffer that can hold upto M disk blocks, we derive a lower bound of $Ω(\sqrt{D}$) on the competitive ratio of any deterministic online prefetching algorithm with O(M) lookahead. NOM is shown to match the lower bound using global M-block lookahead. In contrast, using only local lookahead results in an $Ω(D)$ competitive ratio. When the buffer is distributed into D portions of M/D blocks each, the algorithm GREED based on local lookahead is shown to be optimal, and NOM is within a constant factor of optimal. Thus we provide a theoretical basis for the intuition that global lookahead is more valuable for prefetching in the case of a shared buffer configuration whereas it is enough to provide local lookahead in case of the distributed configuration. Finally, we analyze the performance of these algorithms for reference strings generated by a uniformly-random stochastic process and we show that they achieve the minimal expected number of I/Os. These results also give bounds on the worst-case expected performance of algorithms which employ randomization in the data layout.

Keywords: disk prefetching, file caching, parallel I/O, pario-bib

Comment: See also barve:competitive. They propose two methods for scheduling prefetch operations in the situation where the access pattern is largely known in advance, in such a way as to minimize the total number of parallel I/Os. The two methods are quite straightforward, and yet match the optimum lower bound for an on-line algorithm.

barve:jmergesort:
Rakesh D. Barve, Edward F. Grove, and Jeffrey S. Vitter. Simple randomized mergesort on parallel disks. Parallel Computing, 23(4):601-631, June 1997.
See also earlier version barve:mergesort.

Abstract: We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the ``right'' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the ``right'' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs.

The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M, and B. Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms DSM even when the number D of parallel disks is small.

Keywords: parallel I/O algorithm, sorting, pario-bib

Comment: This paper formerly called barve:mergesort; I discovered that the paper had appeared in SPAA96, so the SPAA96 paper is now called barve:mergesort.

barve:mergesort:
Rakesh D. Barve, Edward F. Grove, and Jeffrey S. Vitter. Simple randomized mergesort on parallel disks. In Proceedings of the Eighth Symposium on Parallel Algorithms and Architectures, pages 109-118, Padua, Italy, June 1996. ACM Press.
See also later version barve:jmergesort.

Keywords: parallel I/O algorithm, sorting, pario-bib

barve:round:
Rakesh Barve, Phillip B. Gibbons, Bruce K. Hillyer, Yossi Matias, Elizabeth Shriver, and Jeffrey Scott Vitter. Round-like behavior in multiple disks on a bus. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 1-9, Atlanta, GA, May 1999. ACM Press.

Abstract: In modern I/O architectures, multiple disk drives are attached to each I/O bus. Under I/O-intensive workloads, the disk latency for a request can be overlapped with the disk latency and data transfers of requests to other disks, potentially resulting in an aggregate I/O throughput at nearly bus bandwidth. This paper reports on a performance impairment that results from a previously unknown form of convoy behavior in disk I/O, which we call rounds. In rounds, independent requests to distinct disks convoy, so that each disk services one request before any disk services its next request. We analyze log files to describe read performance of multiple Seagate Wren-7 disks that share a SCSI bus under a heavy workload, demonstrating the rounds behavior and quantifying its performance impact.

Keywords: disk, I/O bus, parallel I/O, pario-bib

batcher:staran:
K. E. Batcher. STARAN parallel processor system hardware. AFIPS Conference Proceedings, pages 405-410, 1974.

Keywords: parallel architecture, array processor, parallel I/O, SIMD, pario-bib

Comment: This paper is reproduced in Kuhn and Padua's (1981, IEEE) survey ``Tutorial on Parallel Processing.'' The STARAN is an array processor that uses Multi-Dimensional-Access (MDA) memories and permutation networks to access data in bit slices in a variety of ways, with high-speed I/O capabilities. Its router (called the flip network) could permute data among the array processors, or between the array processors and external devices, including disks, video input, and displays.

baylor:methodology:
Sandra Johnson Baylor, Caroline Benveniste, and Leo J. Beolhouwer. A methodology for evaluating parallel I/O performance for massively parallel processors. In Proceedings of the 27th Annual Simulation Symposium, pages 31-40, April 1994.

Keywords: parallel I/O, parallel architecture, simulation, pario-bib

baylor:perfeval:
Sandra Johnson Baylor, Caroline B. Benveniste, and Yarsun Hsu. Performance evaluation of a parallel I/O architecture. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 404-413, Barcelona, July 1995. ACM Press.
See also earlier version baylor:perfeval-tr.

Keywords: performance evaluation, parallel architecture, parallel I/O, pario-bib

Comment: They use a simulator to evaluate the performance of a parallel I/O system. They simulate the network and disks under a synthetic workload, and measure the time it takes for I/O requests to traverse the network, be processed, and return. They also measure the impact of I/O requests on non-I/O messages. Their results are fairly unsurprising.

baylor:perfeval-tr:
Sandra Johnson Baylor, Caroline B. Benveniste, and Yarsun Hsu. Performance evaluation of a parallel I/O architecture. Technical Report RC 20049, IBM T. J. Watson Research Center, May 1995.
See also later version baylor:perfeval.

Keywords: performance evaluation, parallel architecture, parallel I/O, pario-bib

baylor:vulcan-perf:
Sandra Johnson Baylor, Caroline Benveniste, and Yarsun Hsu. Performance evaluation of a massively parallel I/O subsystem. In Proceedings of the IPPS '94 Workshop on Input/Output in Parallel Computer Systems, pages 1-15. IBM Watson Research Center, 1994. Also appeared in Computer Architecture News 22(4).
See also later version baylor:vulcan-perf-book.

Keywords: parallel I/O, parallel architecture, performance analysis, pario-bib

Comment: See polished version baylor:vulcan-perf-book. Simulation of the I/O architecture for the Vulcan MPP at IBM TJW. This is a distributed-memory MIMD system with a bidirectional omega-type interconnection network, and separate compute and I/O nodes. They use a stochastic workload to evaluate the average I/O performance under a few different situations, and then use that average performance, along with a stochastic workload, in a detailed simulation of the interconnection network. (What would be the effect of adding variance to the I/O-node performance?) A key point is that the I/O node will not accept any more requests until a current write request is finished being processed (copied into the write-back cache). If there are many writes, this can backup the network (would a different write-request protocol help?) Not clear how concurrency of reads are modeled. Results show that network saturates for high request rates and small number of I/O nodes. As request rate decreases or number of I/O nodes increases, performance levels off to a reasonable value. Placement of I/O nodes didn't make much difference, nor did extra non-I/O traffic. Given their parameters, and for reasonable loads, 1 I/O node per 4 compute nodes was a reasonable balance, and was scalable.

baylor:vulcan-perf-book:
Sandra Johnson Baylor, Caroline Benveniste, and Yarsun Hsu. Performance evaluation of a massively parallel I/O subsystem. In Jain et al. [iopads-book], chapter 13, pages 293-311.
See also earlier version baylor:vulcan-perf.

Abstract: Presented are the trace-driven simulation results of a study conducted to evaluate the performance of the internal parallel I/O subsystem of the Vulcan massively parallel processor (MPP) architecture. The system sizes evaluated vary from 16 to 512 nodes. The results show that a compute node to I/O node ratio of four is the most cost effective for all system sizes, suggesting high scalability. Also, processor-to-processor communication effects are negligible for small message sizes and the greater the fraction of I/O reads, the better the I/O performance. Worse case I/O node placement is within 13% of more efficient placement strategies. Introducing parallelism into the internal I/O subsystem improves I/O performance significantly.

Keywords: parallel I/O architecture, performance evaluation, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

baylor:workload:
Sandra Johnson Baylor and C. Eric Wu. Parallel I/O workload characteristics using Vesta. In Proceedings of the IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 16-29, April 1995.
See also later version baylor:workload-book.

Abstract: In recent years, the design and performance evaluation of parallel processors has focused on the processor, memory and communication subsystems. As a result, these subsystems have better performance potential than the I/O subsystem. In fact, the I/O subsystem is the bottleneck in many machines. However, there are a number of studies currently underway to improve the design of parallel I/O subsystems. To develop optimal parallel I/O subsystem designs, one must have a thorough understanding of the workload characteristics of parallel I/O and its exploitation of the associated parallel file system. Presented are the results of a study conducted to analyze the parallel I/O workloads of several applications on a parallel processor using the Vesta parallel file system. Traces of the applications are obtained to collect system events, communication events, and parallel I/O events. The traces are then analyzed to determine workload characteristics. The results show I/O request rates on the order of hundreds of requests per second, a large majority of requests are for small amount of data (less than 1500 bytes), a few requests are for large amounts of data (on the order of megabytes), significant file sharing among processes within a job, and strong temporal, traditional spatial, and interprocess spatial locality.

Keywords: parallel I/O, workload characterization, pario-bib

Comment: See polished version baylor:workload-book. They characterize four parallel applications: sort, matrix multiply, seismic migration, and video server, in terms of their I/O activity. They found results that are consistent with kotz:workload, in that they also found lots of small data requests, some large data requests, significant file sharing and interprocess locality. This study found less of the non-contiguous access than did kotz:workload, because of the logical views provided by Vesta. Note on-line postscript does not include figures.

baylor:workload-book:
Sandra Johnson Baylor and C. Eric Wu. Parallel I/O workload characteristics using Vesta. In Jain et al. [iopads-book], chapter 7, pages 167-185.
See also earlier version baylor:workload.

Abstract: To develop optimal parallel I/O subsystems, one must have a thorough understanding of the workload characteristics of parallel I/O and its exploitation of the associated parallel file system. Presented are the results of a study conducted to analyze the parallel I/O workloads of several applications on a parallel processor using the Vesta parallel file system. Traces of the applications are obtained to collect system events, communication events, and parallel I/O events. The traces are then analyzed to determine workload characteristics. The results show I/O request rates on the order of hundreds of requests per second, a large majority of requests are for small amounts of data (less than 1500 bytes), a few requests are for large amounts of data (on the order of megabytes), significant file sharing among processes within a job, and strong temporal, traditional spatial, and interprocess spatial locality.

Keywords: parallel I/O, file access pattern, workload characterization, file system workload, pario-bib

Comment: Part of a whole book on parallel I/O; see iopads-book.

bbn:admin:
BBN Advanced Computers Inc. TC2000 System Administration Guide, revision 3.0 edition, April 1991.

Keywords: BBN, parallel I/O, pario-bib

Comment: Administrative manual for the TC2000 I/O system. Can stripe over partitions in a user-specified set of disks. Large requests automatically split and done in parallel. See also garber:tc2000.

becher:ooc-solver:
Jonathan D. Becher and John F. Porter. Out of core dense solvers for the MasPar parallel computer. Technical Report MP/IP/SP-37.94, MasPar Computer Corporation, 1994.

Keywords: parallel I/O, scientific computing, linear algebra, pario-bib

Comment: They look at out-of-core block and slab solvers for the Maspar. They overlap reading one block with the computation of the previous block. They solve matrices up to 40k x 40k, and obtain 3.14 GFlops even with I/O considered.

bell:physics:
Jean L. Bell. A specialized data management system for parallel execution of particle physics codes. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 277-285, Chicago, IL, 1988. ACM Press.

Keywords: file access pattern, disk prefetch, file system, pario-bib

Comment: A specialized database system for particle physics codes. Valuable for its description of access patterns and subsequent file access requirements. Particle-in-cell codes iterate over timesteps, updating the position of each particle, and then the characteristics of each cell in the grid. Particles may move from cell to cell. Particle update needs itself and nearby gridcell data. The whole dataset is too big for memory, and each timestep must be stored on disk for later analysis anyway. Regular file systems are inadequate: specialized DBMS is more appropriate. Characteristics needed by their application class: multidimensional access (by particle type or by location, i.e., multiple views of the data), coordination between grid and particle data, coordination between processors, coordinated access to meta-data, inverted files, horizontal clustering, large blocking of data, asynchronous I/O, array data, complicated joins, and prefetching according to user-prespecified order. Note that many of these things can be provided by a file system, but that most are hard to come by in typical file systems, if not impossible. Many of these features are generalizable to other applications.

benner:pargraphics:
Robert E. Benner. Parallel graphics algorithms on a 1024-processor hypercube. In Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applications, pages 133-140, Monterey, CA, 1989. Golden Gate Enterprises, Los Altos, CA.

Keywords: hypercube, graphics, parallel algorithm, parallel I/O, pario-bib

Comment: About using the nCUBE/10's RT Graphics System. They were frustrated by an unusual mapping from the graphics memory to the display, a shortage of memory on the graphics nodes, and small message buffers on the graphics nodes. They wrote some algorithms for collecting the columns of pixels from the hypercube nodes, and routing them to the appropriate graphics node. They also would have liked a better interconnection network between the graphics nodes, at least for synchronization.

bennett:jovian:
Robert Bennett, Kelvin Bryant, Alan Sussman, Raja Das, and Joel Saltz. Jovian: A framework for optimizing parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages 10-20, Mississippi State, MS, October 1994. IEEE Computer Society Press.

Keywords: parallel I/O, pario-bib

Comment: Jovian is a runtime library for use with SPMD codes, eg, HPF. They restrict IO to collective operations, and provide extra processes to 'coalesce' the many requests from multiple CPs into fewer larger requests to the operating system, perhaps optimized for access order. They mention that there is a standardization process underway for specifying data distributions. Also a compact representation for strided access to n-dimensional data structures. Coalescing basically means combining requests to eliminate duplication and to combine adjacent requests. Requests to coalescers are in full blocks, to lower the processing overhead. Nonetheless, their method involves moving requests around twice, and involve several memory-memory copies of the data, so their overhead is high.

berdahl:transport:
Lawrence Berdahl. Parallel transport protocol proposal. Lawrence Livermore National Labs, January 3, 1995. Draft.
See also earlier version berdahl:woodenman.

Keywords: parallel I/O, network, supercomputer system, pario-bib

Comment: An update of berdahl:woodenman, close to the final draft.

berdahl:woodenman:
Lawrence Berdahl. Parallel data exchange. Lawrence Livermore National Labs, January 28, 1994. WoodenMan Proposal.
See also later version berdahl:transport.

Keywords: parallel I/O, network, supercomputer system, pario-bib

Comment: They describe a protocol for making parallel data transfers of arbitrary data sets from one set of data servers to another set of data servers. The goal is to be independent of specific architectures or even types of data servers, and to work on top of existing transport protocols. The data set is described using a gather set for the source and a scatter set for the destination, and using a linear address space as an intermediate representation. All the servers are contacted, they figure out who they need to talk, and exchange port information with them. Each pair exchanges votes on who will control the transfer (ie, who will control the order of the transfer), and on their maximum data rates. This information is used to settle on the control and set of ports to be used. This proposal is not final and is under active development, so it may change.

berrendorf:paragon:
R. Berrendorf, H. Burg, and U. Detert. Performance characteristics of parallel computers: Intel Paragon case study. {IT+TI} Informationstechnik und Technische Informatik, 37(2):37-45, April 1995. (In German).

Keywords: parallel computing, performance evaluation, parallel file system, pario-bib

Comment: In German. They summarize typical performance of the Intel Paragon, including the communication performance and the parallel file-system performance.

berry:nasa:
Michael R. Berry and Tarek A. El-Ghazawi. Parallel input/output characteristics of NASA science applications. In Proceedings of the Tenth International Parallel Processing Symposium, pages 741-747, Honolulu, April 1996. IEEE Computer Society Press.

Keywords: scientific computation, application, parallel I/O, pario-bib

bershad:sio-os:
Brian Bershad, David Black, David DeWitt, Garth Gibson, Kai Li, Larry Peterson, and Marc Snir. Operating system support for high-performance parallel I/O systems. Technical Report CCSF-40, Scalable I/O Initiative, Caltech Concurrent Supercomputing Facilities, Caltech, 1994.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Four major components: networking, memory servers, file system, and persistent object store. Networking part focuses on low-latency support communication within an application, between applications, and between machines (Bershad and Peterson). Memory servers, shared virtual memory, and checkpointing support (Kai Li). File systems support includes benchmarking, transparent informed prefetching (Gibson), a common interface for PFS and Vesta (Snir), and integrating secondary and tertiary storage systems (including the integration of the National Storage Lab's HPSS (see coyne:hpss) into this project in 1995). OSF/1 (Black) will be extended to support parallel file systems, extent-like behavior, and block coalescing. Persistent object store (DeWitt) is radical change to an object-oriented interface, transparent I/O (though extensible and changable with subclassing, presumably), and heterogeneous support via the Object Definition Language standard. Persistent objects may be integrated with the memory servers and shared virtual memory. See also poole:sio-survey, bagrodia:sio-character, choudhary:sio-language.

berson:multimedia:
Steven Berson, Leana Golubchik, and Richard R. Muntz. Fault tolerant design of multimedia servers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 364-375. ACM Press, 1995.

Keywords: fault tolerance, multimedia, video on demand, parallel I/O, pario-bib

best:cmmdio:
Michael L. Best, Adam Greenberg, Craig Stanfill, and Lewis W. Tucker. CMMD I/O: A parallel Unix I/O. In Proceedings of the Seventh International Parallel Processing Symposium, pages 489-495, Newport Beach, CA, 1993. IEEE Computer Society Press.

Keywords: parallel I/O, multiprocessor file system, pario-bib

Comment: Much like Intel CFS, with different I/O modes that determine when the compute nodes synchronize, and the semantics of I/Os written to the file. They found it hard to get good bandwidth for independent I/Os, as opposed to coordinated I/Os; part of this was due to their RAID 3 disk array, but it is more complicated than that. Some performance numbers were given in talk.

bestavros:raid:
Azer Bestavros. IDA-based redundant arrays of inexpensive disks. In Proceedings of the First International Conference on Parallel and Distributed Information Systems, pages 2-9, December 1991.

Keywords: RAID, disk array, reliability, parallel I/O, pario-bib

Comment: Uses the Information Dispersal Algorithm (IDA) to generate $n+m$ blocks from $n$ blocks, to tolerate $m$ disk failures; all of the data from the $n$ blocks is hidden in the $n+m$ blocks. Not with the RAID project.

bester:gass:
Joseph Bester, Ian Foster, Carl Kesselman, Jean Tedesco, and Steven Tuecke. GASS: A data movement and access service for wide area computing systems. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 78-88, Atlanta, GA, May 1999. ACM Press.

Abstract: In wide area computing, programs frequently execute at sites that are distant from their data. Data access mechanisms are required that place limited functionality demands on an application or host system yet permit high-performance implementations. To address these requirements, we propose a data movement and access service called Global Access to Secondary Storage (GASS). This service defines a global name space via Uniform Resource Locators and allows applications to access remote files via standard I/O interfaces. High performance is achieved by incorporating default data movement strategies that are specialized for I/O patterns common in wide area applications and by providing support for programmer management of data movement. GASS forms part of the Globus toolkit, a set of services for high-performance distributed computing. GASS itself makes use of Globus services for security and communication, and other Globus components use GASS services for executable staging and real-time remote monitoring. Application experiences demonstrate that the library has practical utility.

Keywords: wide-area network, parallel I/O, pario-bib

beynon:datacutter:
Michael D. Beynon, Renato Ferreira, Tahsin Kurc, Alan Sussman, and Joel Saltz. DataCutter: Middleware for filtering very large scientific datasets on archival storage systems. In Proceedings of the 2000 Mass Storage Systems Conference, pages 119-133, College Park, MD, March 2000. IEEE Computer Society Press.

Keywords: data grid, filter, pario-bib

bitton:schedule:
Dina Bitton. Arm scheduling in shadowed disks. In Proceedings of IEEE Compcon, pages 132-136, Spring 1989.

Keywords: parallel I/O, disk shadowing, reliability, disk mirroring, disk optimization, pario-bib

Comment: Goes further than bitton:shadow. Uses simulation to verify results from that paper, which were expressions for the expected seek distance of shadowed disks, using shortest-seek-time arm scheduling. Problem is her assumption that arm positions stay independent, in the face of correlating effects like writes, which move all arms to the same place. Simulations match model only barely, and only in some cases. Anyway, shadowed disks can improve performance for workloads more than 60 or 70% reads.

bitton:shadow:
D. Bitton and J. Gray. Disk shadowing. In Proceedings of the 14th International Conference on Very Large Data Bases, pages 331-338, 1988.

Keywords: parallel I/O, disk shadowing, reliability, disk mirroring, disk optimization, pario-bib

Comment: Also TR UIC EECS 88-1 from Univ of Illinois at Chicago. Shadowed disks are mirroring with more than 2 disks. Writes to all disks, reads from one with shortest seek time. Acknowledges but ignores problem posed by lo:disks. Also considers that newer disk technology does not have linear seek time $(a+bx)$ but rather $(a+b\sqrt{x})$. Shows that with either seek distribution the average seek time for workloads with at least 60% reads decreases in the number of disks. See also bitton:schedule.

bjorstad:structure:
P. E. Bj\orstad and J. Cook. Large scale structural analysis on massively parallel computers. In Linear Algebra for Large Scale and Real-Time Applications, pages 3-11. Kluwer Academic Publishers, 1993. ftp from ftp.ii.uib.no in \verb+pub/tech_reports/mpp_sestra.ps.Z+.

Keywords: parallel I/O, file access pattern, pario-bib

Comment: A substantial part of this structural-analysis application was involved in I/O, moving substructures in and out of RAM. The Maspar IO-RAM helped a lot, nearly halving the time required. On the Cray, the SSD had an even bigger impact, perhaps 7-12 times faster. Their main conclusion is that caching helped. Most likely this was due to its double-buffering, since they structured the code to read/compute/write in large ``superblocks''.

blaum:evenodd:
Mario Blaum, Jim Brady, Jehoshua Bruck, Jai Menon, and Alexander Vardy. The EVENODD code and its generalization: An efficient scheme for tolerating multiple disk failures in RAID architectures. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 14, pages 187-208. IEEE Computer Society Press and Wiley, New York, NY, 2001.

Keywords: disk array, RAID, parallel I/O, pario-bib

Comment: Part of jin:io-book.

bonachea:java-io:
Dan Bonachea, Phillip Dickens, and Rajeev Thakur. High-performance file I/O in Java: Existing approaches and bulk I/O extensions. Concurrency and Computation: Practice and Experience, 13(8-9):713-736, 2001.
See also earlier version bonachea:java-io-tr.

Abstract: There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance, but also high-performance I/O. In this paper, we first examine several approaches that attempt to provide high-performance I/O in Java-many of which are not obvious at first glance-and evaluate their performance on two parallel machines, the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically. The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface (JNI) and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.

Keywords: parallel I/O, Java, file system interface, pario-bib

bonachea:java-io-tr:
Dan Bonachea, Phillip Dickens, and Rajeev Thakur. High-performance file I/O in Java: Existing approaches and bulk I/O extensions. Technical Report ANL/MCS-P840-0800, Mathematics and Computer Science Division, Argonne National Laboratory, August 2000.
See also later version bonachea:java-io.

Abstract: There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance, but also high-performance I/O. In this paper, we first examine several approaches that attempt to provide high-performance I/O in Java-many of which are not obvious at first glance-and evaluate their performance on two parallel machines, the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O API and improve performance dramatically. The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface (JNI) and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.

Keywords: parallel I/O, java, file system interface, pario-bib

boral:bubba:
Haran Boral, William Alexander, Larry Clay, George Copeland, Scott Danforth, Michael Franklin, Brian Hart, Marc Smith, and Patrick Valduriez. Prototyping Bubba, a highly parallel database system. IEEE Transactions on Knowledge and Data Engineering, 2(1), March 1990.

Keywords: parallel I/O, database, disk caching, pario-bib

Comment: More recent than copeland:bubba, and a little more general. This gives few details, and doesn't spend much time on the parallel I/O. Bubba does use parallel independent disks, with a significant effort to place data on the disks, and do the work local to the disks, to balance the load and minimize interprocessor communication. Also they use a single-level store (i.e., memory-mapped files) to improve performance of their I/O system, including page locking that is assisted by the MMU. The OS has hooks for the database manager to give memory-management policy hints.

boral:critique:
H. Boral and D. DeWitt. Database machines: an idea whose time has passed?. In Proceedings of the Second International Workshop on Database Machines, pages 166-187. Springer-Verlag, 1983.

Keywords: file access pattern, parallel I/O, database machine, pario-bib

Comment: Improvements in I/O bandwidth crucial for supporting database machines, otherwise highly parallel DB machines are useless (I/O bound). Two ways to do it: 1) synchronized interleaving by using custom controller and regular disks to read/write same track on all disks, which speeds individual accesses. 2) use very large cache (100-200M) to keep blocks to re-use and to do prefetching. But see dewitt:pardbs.

bordawekar:collective:
Rajesh Bordawekar. Implementation of collective I/O in the Intel Paragon parallel file system: Initial experiences. In Proceedings of the 11th ACM International Conference on Supercomputing, pages 20-27. ACM Press, July 1997.
See also earlier version bordawekar:collective-tr.

Keywords: collective I/O, multiprocessor file system, parallel I/O, pario-bib

Comment: bordawekar:collective was renamed bordawekar:collective-tr, so this could be called bordawekar:collective.

bordawekar:collective-tr:
Rajesh Bordawekar. Implementation and evaluation of collective I/O in the Intel Paragon Parallel File System. Technical Report CACR TR-128, Center of Advanced Computing Research, California Insititute of Technology, November 1996.
See also later version bordawekar:collective.

Abstract: A majority of parallel applications obtain parallelism by partitioning data over multiple processors. Accessing distributed data structures like arrays from files often requires each processor to make a large number of small non-contiguous data requests. This problem can be addressed by replacing small non-contiguous requests by large collective requests. This approach, known as Collective I/O, has been found to work extremely well in practice. In this paper, we describe implementation and evaluation of a collective I/O prototype in a production parallel file system on the Intel Paragon. The prototype is implemented in the PFS subsystem of the Intel Paragon Operating System. We evaluate the collective I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O modes. It is observed that collective I/O provides significant performance improvement over accesses in M_UNIX mode. However, in many cases, various implementation overheads cause collective I/O to provide lower performance than the M_RECORD I/O mode.

Keywords: parallel I/O, mutliprocessor file system, pario-bib

Comment: This tech report was called bordawekar:collective, then renamed bordawekar:collective-tr, on the appearance of the ICS paper bordawekar:collective.

bordawekar:comm:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. In Proceedings of the 9th ACM International Conference on Supercomputing, pages 395-403, Barcelona, July 1995. ACM Press.
See also earlier version bordawekar:comm-tr.

Keywords: parallel I/O, inter-processor communication, pario-bib

Comment: bordawekar:comm-tr is nearly identical in content. Also bordawekar:commstrat is a shorter version.

bordawekar:comm-tr:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. Technical Report SCCS-667, NPAC, Syracuse University, 1994.
See also later version bordawekar:comm.

Abstract: In this paper, we show that communication in the out-of-core distributed memory problems requires both inter-processor communication and file I/O. Given that primary data structures reside in files, even communication requires I/O. Thus, it is important to optimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method, termed as the "out-of-core" communication method, follows a loosely synchronous model. Computation and Communication phases in this case are clearly separated, and communication requires permutation of data in files. The second method, termed as "demand-driven-in-core communication" considers only communication required of each in-core data slab individually. The third method, termed as "producer-driven-in-core communication" goes even one step further and tries to identify the potential (future) use of data while it is in memory. We describe these methods in detail and provide performance results for out-of-core applications; namely, two-dimensional FFT and two-dimensional elliptic solver. Finally, we discuss how "out-of-core" and "in-core" communication methods could be used in virtual memory environments on distributed memory machines.

Keywords: parallel I/O, inter-processor communication, pario-bib

Comment: They compare different ways to do global communications in out-of-core applications, involving file I/O and communication at different times. They also comment briefly on how it would work if it depended on virtual memory at each node.

bordawekar:commstrat:
Rajesh Bordawekar and Alok Choudhary. Communication strategies for out-of-core programs on distributed memory machines. In Proceedings of the 1995 International Conference on High Performance Computing, pages 130-135, New Delhi, India, December 1995.
See also earlier version bordawekar:comm.

Keywords: interprocessor communication, parallel I/O, pario-bib

Comment: Small version of bordawekar:comm.

bordawekar:compcomm:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Compilation and communication strategies for out-of-core programs on distributed-memory machines. Journal of Parallel and Distributed Computing, 38(2):277-288, November 1996.
See also earlier version bordawekar:compcomm-tr.

Abstract: It is widely acknowledged that improving parallel I/O performance is critical for widespread adoption of high performance computing. In this paper, we show that communication in out-of-core distributed memory problems may require both inter-processor communication and file I/O. Thus, in order to improve I/O performance, it is necessary to minimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method called the generalized collective communication method follows a loosely synchronous model; computation and communication phases are clearly separated, and communication requires permutation of data in files. The second method called the receiver-driven in-core communication considers only communication required of each in-core data slab individually. The third method called the owner-driven in-core communication goes even one step further and tries to identify the potential