Parallel-I/O bibliography, ninth edition

February 22, 1997

These entries were newly added between the eighth and ninth edition of the parallel-I/O bibliography.


agrawal:asynch:
Gagan Agrawal, Anurag Acharya, and Joel Saltz. An interprocedural framework for placement of asynchronous I/O operations. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 358-365, Philadelphia, PA, May 1996. ACM Press.

Keyword: compiler, I/O, pario-bib

Comment: Not really about parallel applications or parallel I/O, but I think it may be of interest to that community. They propose a compiler framework for a compiler to insert asynchronous I/O operations (start I/O, finish I/O), to satisfy the dependency constraints of the program.

ap:thesis:
Apratim Purakayastha. Characterizing and Optimizing Parallel File Systems. PhD thesis, Dept. of Computer Science, Duke University, Durham, NC, June 1996. Also available as technical report CS-1996-10.

Abstract: High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. In the first part of this dissertation, we attempt to fill this void by measuring a real file-system workload on a production parallel machine, namely the CM-5 at the National Center for Supercomputing Applications. We collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for design of future parallel file systems. Our usage study showed that writes to write-only files are a dominant part of the workload. Therefore, optimizing writes could have a significant impact on overall performance. In the second part of this dissertation, we propose ENWRICH, a compute-processor write-caching scheme for write-only files in parallel file systems. Within its framework, ENWRICH uses a recently proposed high performance implementation of collective I/O operations called disk-directed I/O, but it eliminates a number of limitations of disk-directed I/O. ENWRICH combines low-overhead write caching at the compute processors with high performance disk-directed I/O at the I/O processors to achieve both low latency and high bandwidth. This combination facilitates the use of the powerful disk-directed I/O technique independent of any particular choice of interface, and without the requirement for mapping libraries at the I/O processors. By collecting writes over many files and applications, ENWRICH lets the I/O processors optimize disk I/O over a large pool of requests. We evaluate our design of ENWRICH using simulated implementation and extensive experimentation. We show that ENWRICH achieves high performance for various configurations and workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for certain workloads, and suggest possible enhancements. Finally, we discuss the nuances of implementing ENWRICH on a real platform and speculate about possible adaptations of ENWRICH for emerging multiprocessing platforms.

Keyword: parallel I/O, multiprocessor file system, file access patterns, workload characterization, file caching, disk-directed I/O, pario-bib

Comment: See also ap:enwrich, ap:workload, and nieuwejaar:workload

arunachalam:prefetch2:
Meenkashi Arunachalam, Alok Choudhary, and Brad Rullman. Implementation and evaluation of prefetching in the Intel Paragon Parallel File System. In Proceedings of the Tenth International Parallel Processing Symposium, pages 554-559, April 1996.

Abstract: The significant difference between the speeds of the I/O system (e.g., disks) and compute processors in parallel systems creates a bottleneck that lowers the performance of an application that does a considerable amount of disk accesses. A major portion of the compute processors' time is wasted on waiting for I/O to complete. This problem can be addressed to a certain extent, if the necessary data can be fetched from the disk before the I/O call to the disk is issued. Fetching data ahead of time, known as prefetching in a multiprocessor environment depends a great deal on the application's access pattern. The subject of this paper is implementation and performance evaluation of a prefetching prototype in a production parallel file system on the Intel Paragon. Specifically, this paper presents a) design and implementation of a prefetching strategy in the parallel file system and b) performance measurements and evaluation of the file system with and without prefetching. The prototype is designed at the operating system level for the PFS. It is implemented in the PFS subsystem of the Intel Paragon Operating System. It is observed that in many cases prefetching provides considerable performance improvements. In some other cases no improvements or some performance degradation is observed due to the overheads incurred in prefetching.

Keyword: parallel I/O, prefetching, multiprocessor file system, pario-bib

Comment: See arunachalam:prefetch.

bordawekar:collective:
Rajesh Bordawekar. Implementation and evaluation of collective I/O in the Intel Paragon Parallel File System. Technical Report CACR~TR-128, Center of Advanced Computing Research, California Insititute of Technology, November 1996.

Abstract: A majority of parallel applications obtain parallelism by partitioning data over multiple processors. Accessing distributed data structures like arrays from files often requires each processor to make a large number of small non-contiguous data requests. This problem can be addressed by replacing small non-contiguous requests by large collective requests. This approach, known as Collective I/O, has been found to work extremely well in practice. In this paper, we describe implementation and evaluation of a collective I/O prototype in a production parallel file system on the Intel Paragon. The prototype is implemented in the PFS subsystem of the Intel Paragon Operating System. We evaluate the collective I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O modes. It is observed that collective I/O provides significant performance improvement over accesses in M_UNIX mode. However, in many cases, various implementation overheads cause collective I/O to provide lower performance than the M_RECORD I/O mode.

Keyword: parallel I/O, mutliprocessor file system, pario-bib

bordawekar:compcomm-tr:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. Compilation and communication strategies for out-of-core programs on distributed memory machines. Technical Report CACR-113, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, November 1995.
See also later version bordawekar:compcomm.

Abstract: It is widely acknowledged that improving parallel I/O performance is critical for widespread adoption of high performance computing. In this paper, we show that communication in out-of-core distributed memory problems may require both inter-processor communication and file I/O. Thus, in order to improve I/O performance, it is necessary to minimize the I/O costs associated with a communication step. We present three methods for performing communication in out-of-core distributed memory problems. The first method called the generalized collective communication method follows a loosely synchronous model; computation and communication phases are clearly separated, and communication requires permutation of data in files. The second method called the receiver-driven in-core communication considers only communication required of each in-core data slab individually. The third method called the owner-driven in-core communication goes even one step further and tries to identify the potential future use of data (by the recipients) while it is in the sender's memory. We describe these methods in detail and present a simple heuristic to choose a communication method from among the three methods. We then provide performance results for two out-of-core applications, the two-dimensional FFT code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how the out-of-core and in-core communication methods can be used in virtual memory environments on distributed memory machines.

Comment: See also bordawekar:comm, at ICS'95.

bordawekar:placement-tr:
Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. A framework for integrated communication and I/O placement. Technical Report CACR-117, Scalable I/O Initiative, Center of Advanced Computing Research, California Insititute of Technology, February 1996.
See also later version bordawekar:placement.

Abstract: In this paper, we describe a framework for optimizing communication and I/O costs in out-of-core problems. We focus on communication and I/O optimization within a FORALL construct. We show that existing frameworks do not extend directly to out-of-core problems and can not exploit the FORALL semantics. We present a unified framework for the placement of I/O and communication calls and apply it for optimizing communication for stencil applications. Using the experimental results, we demonstrate that correct placement of I/O and communication calls can completely eliminate extra file I/O from communication and obtain significant performance improvement.

Keyword: parallel I/O, compiler, pario-bib

brezany:architecture:
Peter Brezany, Thomas A. Mueck, and Erich Schikuta. A software architecture for massively parallel input-output. In Third International Workshop PARA'96 (Applied Parallel Computing - Industrial Computation and Optimization), volume 1186 of Lecture Notes in Computer Science, pages 85-96, Lyngby, Denmark, August 1996. Springer-Verlag. Also available as Technical Report of the Inst. f.~Angewandte Informatik u. Informationssysteme, University of Vienna, TR~96202.

Abstract: For an increasing number of data intensive scientific applications, parallel I/O concepts are a major performance issue. Tackling this issue, we provide an outline of an input/output system designed for highly efficient, scalable and conveniently usable parallel I/O on distributed memory systems. The main focus of this paper is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of High Performance FORTRAN efforts. Specifically, our design is presented in the context of the Vienna Fortran Compilation System.

Keyword: compiler transformations, runtime support, parallel I/O, prefetching, pario-bib

brezany:compiling:
Peter Brezany, Thomas A. Mueck, and Erich Schikuta. Mass storage support for a parallelizing compilation system. In International Conference Eurosim'96- HPCN challenges in Telecomp and Telecom: Parallel Simulation of Complex Systems and Large Scale Applications, pages 63-70, Delft, The Netherlands, June 1996. North-Holland, Elsevier Science.

Keyword: parallel I/O, high performance mass storage system, high performance languages, compilation techniques, data administration, pario-bib

brezany:irregular-tr:
P. Brezany and A. Choudhary. Techniques and optimizations for developing irregular out-of-core applications on distributed-memory systems. Technical Report 96-4, Institute for Software Technology and Parallel Systems, University of Vienna, November 1996.

Keyword: parallel I/O, out of core, irregular applications, compiler, pario-bib

cao:tickertaip-tr:
Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The TickerTAIP parallel RAID architecture. Technical Report HPL-92-151, HP Labs, December 1992.
See also later version cao:tickertaip-tr2.

Keyword: parallel I/O, RAID, pario-bib

Comment: A parallelized RAID architecture that distributes the RAID controller operations across several worker nodes. Multiple hosts can connect to different workers, allowing multiple paths into the array. The workers then communicate on their own fast interconnect to accomplish the requests, distributing parity computations across multiple workers. They get much better performance and reliability than plain RAID. They built a prototype and a performance simulator. Two-phase commit was needed for request atomicity, and a request sequencer was needed for serialization. Also found it was good to give the whole request info to all workers and to let them figure out what to do and when. Superceded by cao:tickertaip-tr2 and cao:tickertaip.

carretero:mapping:
J. Carretero, F. P\'erez, P. de Miguel, F. Garc\'\ia, and L. Alonso. I/O data mapping in \em ParFiSys: support for high-performance I/O in parallel and distributed systems. In Euro-Par~'96, volume 1123 of Lecture Notes in Computer Science, pages 522-526. Springer-Verlag, August 1996.

Abstract: This paper gives an overview of the I/O data mapping mechanisms of {\em ParFiSys}. Grouped management and parallelization are presented as relevant features. I/O data mapping mechanisms of {\em ParFiSys}, including all levels of the hierarchy, are described in this paper.

Keyword: parallel I/O, multiprocessor file system, pario-bib

carretero:subsystem:
J. Carretero, F. P\'erez, P. de Miguel, F. Garc\'\ia, and L. Alonso. A massively parallel and distributed I/O subsystem. Computer Architecture News, 24(3):1-8, June 1996.

Keyword: parallel I/O, I/O architecture, pario-bib

Comment: See carretero:*, rosales:cds, perez:clfs.

chehadeh:oodb:
Y. C. Chehadeh, A. R. Hurson, L. L. Miller, S. Pakzad, and B. N. Jamoussi. Application for parallel disks for efficient handling of object-oriented databases. In Proceedings of the 1993 IEEE Symposium on Parallel and Distributed Processing, pages 184-191. IEEE Computer Society Press, 1993.

Abstract: In today's workstation based environment, applications such as design databases, multimedia databases, and knowledge bases do not fit well into the relational data processing framework. The object-oriented data model has been proposed to model and process such complex databases. Due to the nature of the supported applications, object-oriented database systems need efficient mechanisms for the retrieval of complex objects and the navigation along the semantic links among objects. Object clustering and buffering have been suggested as efficient mechanisms for the retrieval of complex objects. However, to improve the efficiency of the aforementioned operations, one has to look at the recent advances in storage technology. This paper is an attempt to investigate the feasibility of using parallel disks for object-oriented databases. It analyzes the conceptual changes needed to map the clustering and buffering schemes proposed on the new underlying architecture. The simulation and performance evaluation of the proposed leveled-clustering and mapping schemes utilizing parallel I/O disks are presented and analyzed.

Keyword: parallel I/O, disk array, object oriented database, pario-bib

chen:panda-model:
Y. Chen, M. Winslett, S. Kuo, Y. Cho, M. Subramaniam, and K. E. Seamons. Performance modeling for the Panda array I/O library. In Proceedings of Supercomputing '96. ACM Press and IEEE Computer Society Press, November 1996.

Abstract: We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application i/o requests, the first step toward our long-term goal of automatically optimizing i/o request handling in Panda.

Keyword: performance modeling, parallel I/O, pario-bib

Comment: Web and CDROM only.

chen:raid-perf:
S. Chen and D. Towsley. A performance evaluation of RAID architectures. IEEE Transactions on Computers, 45(10):1116, October 1996.

Keyword: verify pages, parallel I/O, RAID, disk array, pario-bib

chiang:graph:
Yi-Jen Chiang, , Michael T. Goodrich, Edward F. Grove, Roberto Tamassia, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory graph algorithms (extended abstract). In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA '95), pages 139-149, January 1995.

Abstract: We present a collection of new techniques for designing and analyzing efficient external-memory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include: \begin{itemize} \item {\em Proximate-neighboring}. We present a simple method for deriving external-memory lower bounds via reductions from a problem we call the ``proximate neighbors'' problem. We use this technique to derive non-trivial lower bounds for such problems as list ranking, expression tree evaluation, and connected components. \item {\em PRAM simulation}. We give methods for efficiently simulating PRAM computations in external memory, even for some cases in which the PRAM algorithm is not work-optimal. We apply this to derive a number of optimal (and simple) external-memory graph algorithms. \item {\em Time-forward processing}. We present a general technique for evaluating circuits (or ``circuit-like'' computations) in external memory. We also use this in a deterministic list ranking algorithm. \item {\em Deterministic 3-coloring of a cycle}. We give several optimal methods for 3-coloring a cycle, which can be used as a subroutine for finding large independent sets for list ranking. Our ideas go beyond a straightforward PRAM simulation, and may be of independent interest. \item {\em External depth-first search}. We discuss a method for performing depth first search and solving related problems efficiently in external memory. Our technique can be used in conjunction with ideas due to Ullman and Yannakakis in order to solve graph problems involving closed semi-ring computations even when their assumption that vertices fit in main memory does not hold. \end{itemize}

Our techniques apply to a number of problems, including list ranking, which we discuss in detail, finding Euler tours, expression-tree evaluation, centroid decomposition of a tree, least-common ancestors, minimum spanning tree verification, connected and biconnected components, minimum spanning forest, ear decomposition, topological sorting, reachability, graph drawing, and visibility representation.

chiung-san:xdac:
Lee Chiung-San, Parng Tai-Ming, Lee Jew-Chin, Tsai Cheng-Nan, and Farn Kwo-Jean. Performance analysis of the XDAC disk array system. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 620-627. IEEE Computer Society Press, 1994.

Abstract: The paper presents an analytical model of a whole disk array architecture, XDAC, which consists of several major subsystems and features: the two-dimensional array structure; IO-bus with split transaction protocol; and cache for processing multiple I/O requests in parallel. Our modelling approach is based on a subsystem access time per request (SATPR) concept, in which we model for each subsystem the mean access time per disk array request. The model is fed with a given set of representative workload parameters and then used to conduct performance analysis for exploring the impact of fork/join synchronization as well as evaluating some architectural design issues of the XDAC system. Moreover, by comparing the SATPRs of subsystems, we can identify the bottleneck for performance improvements.

Keyword: disk array, performance evaluation, analytical model, parallel I/O, pario-bib

choudhary:sdcr:
Alok Choudhary and David Kotz. Large-scale file systems with the flexibility of databases. ACM Computing Surveys, 28A(4), December 1996. Position paper for the Working Group on Storage I/O for Large-Scale Computing, ACM Workshop on Strategic Directions in Computing Research. Available on-line only, at http://www.acm.org/surveys/1996/ChoudharyFile/.

Keyword: file system, database, parallel I/O, pario-bib

Comment: A position paper for the Strategic Directions in Computer Research workshop at MIT in June 1996.

chung-sheng:arrays:
Li Chung-Sheng, Chen Ming-Syan, P. S. Yu, and Hsiao Hui-I. Combining replication and parity approaches for fault-tolerant disk arrays. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 360-367. IEEE Computer Society Press, 1994.

Abstract: We explore the method of combining the replication and parity approaches to tolerate multiple disk failures in a disk array. In addition to the conventional mirrored and chained declustering methods, a method based on the hybrid of mirrored-and-chained declustering is explored. A performance study that explores the effect of combining replication and parity approaches is conducted. It is experimentally shown that the proposed approach can lead to the most cost-effective solution if the objective is to sustain the same load as before the failures.

Keyword: fault tolerance, disk array, replication, declustering, parallel I/O, pario-bib

Comment: Consider hybrid chained and mirrored declustering.

corbett:jvesta:
Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225-264, August 1996.
See also earlier version corbett:vesta.

Keyword: multiprocessor file system, Vesta, parallel I/O, pario-bib

Comment: See also corbett:pfs, corbett:vesta*, feitelson:pario. This is the ultimate Vesta reference. There seem to be only a few small things that are completely new over what's been published elsewhere, although this presentation is much more complete and polished.

corbett:sio-api1.0:
Peter F. Corbett, Jean-Pierre Prost, Chris Demetriou, Garth Gibson, Erik Reidel, Jim Zelenka, Yuqun Chen, Ed Felten, Kai Li, John Hartman, Larry Peterson, Brian Bershad, Alec Wolman, and Ruth Aydt. Proposal for a common parallel file system programming interface. WWW http://www.cs.arizona.edu/sio/api1.0.ps, September 1996. Version 1.0.

Keyword: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Specs of the proposed SIO low-level interface for parallel file systems. Key features: linear file model, scatter-gather read and write calls (list of strided segments), asynch versions of all calls, extensive hint system. Naming structure is unspecified; no directories specified. Permissions left out. Some control over client caching and over disk layout. Each file has a (small) 'label', which is just a little space for application-controlled meta data. Optional extensions: collective read and write calls, fast copy.

cormen:early-vic:
Thomas H. Cormen and Melissa Hirschl. Early experiences in evaluating the parallel disk model with the ViC* implementation. Technical Report PCS-TR96-293, Dept. of Computer Science, Dartmouth College, August 1996. To appear in {\em Parallel Computing.}

Abstract: Although several algorithms have been developed for the Parallel Disk Model (PDM), few have been implemented. Consequently, little has been known about the accuracy of the PDM in measuring I/O time and total time to perform an out-of-core computation. This paper analyzes timing results on a uniprocessor with several disks for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM.

The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, two (problem size and memory size) are good indicators of I/O time and running time, but the other two (block size and number of disks) are not. Third, because PDM algorithms tend not to be I/O bound, asynchronous I/O effectively hides I/O times.

The software interface to the PDM is part of the ViC* run-time library. The interface is a set of wrappers that are designed to be both efficient and portable across several parallel file systems and target machines.

Keyword: parallel I/O, parallel I/O algorithm, compiler, pario-bib

cormen:fft:
Thomas H. Cormen and David M. Nicol. Performing out-of-core FFTs on parallel disk systems. Parallel Computing, 1997. To appear; currently available as Dartmouth Technical Report PCS-TR96-294.
See also earlier version cormen:fft-tr.

Keyword: verify month number volume and pages, parallel I/O, out of core, scientific computing, FFT, pario-bib

cormen:fft-tr:
Thomas H. Cormen and David M. Nicol. Performing out-of-core FFTs on parallel disk systems. Technical Report PCS-TR96-294, Dept. of Computer Science, Dartmouth College, 1996.
See also later version cormen:fft.

Abstract: The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be entirely solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory.

Keyword: parallel I/O, out of core, scientific computing, FFT, pario-bib

cormen:fft2-tr:
Thomas H. Cormen, Jake Wegmann, and David M. Nicol. Multiprocessor out-of-core FFTs with distributed memory and parallel disks. Technical Report PCS-TR97-303, Dept. of Computer Science, Dartmouth College, 1997. Submitted to SPAA'97.

Abstract: This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of "mini-butterfly" computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86\% of the time of those that do. Moreover, the faster methods are much easier to implement.

Keyword: parallel I/O, out of core, scientific computing, FFT, pario-bib

Comment: Extends the work of cormen:fft.

cortes:paca-tr:
Toni Cortes, Sergi Girona, and Jes\'us Labarta. PACA: A cooperative file system cache for parallel machines. Technical Report 96-07, UPC-CEPBA, 1996.
See also later version cortes:paca.

Keyword: file caching, multiprocessor file system, cooperative caching, parallel I/O, pario-bib

Comment: See cortes:paca.

cortes:pafs:
Toni Cortes, Sergi Girona, and Jes\'us Labarta. Avoiding the cache-coherence problem in a parallel/distributed file system. In Proceedings of the High-Performace Computing and Networking, April 1997.

Abstract: In this paper we describe PAFS, a new parallel/distributed file system. Within the whole file system, special interest is placed on the caching mechanism. We present a cooperative cache that has the advantages of cooperation and avoids the problems derived from the coherence mechanisms. Furthermore, this has been achieved with a reasonable gain in performance. In order to show the obtained performance, we present a comparison between PAFS and xFS (a file system that also implements a cooperative cache).

Keyword: verify pages, file caching, multiprocessor file system, cooperative caching, cache coherence, parallel I/O, pario-bib

Comment: Contact toni@ac.upc.es.

cypher:jrequire:
Robert Cypher, Alex Ho, Smaragda Konstantinidou, and Paul Messina. A quantitative study of parallel scientific applications with explicit communication. Journal of Supercomputing, 10(1):5-24, March 1996.
See also earlier version cypher:require.

Keyword: workload characterization, scientific computing, parallel programming, message passing, pario-bib

Comment: Some mention of I/O.

demmel:eosdis:
James Demmel, Melody Y. Ivory, and Sharon L. Smith. Modeling and identifying bottlenecks in EOSDIS. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 300-308. IEEE Computer Society Press, October 1996.

Abstract: Many parallel application areas that exploit massive parallelism, such as climate modeling, require massive storage systems for the archival and retrieval of data sets. As such, advances in massively parallel computation must be coupled with advances in mass storage technology in order to satisfy I/O constraints of these applications. We demonstrate the effects of such I/O-computation disparity for a representative distributed information system, NASA's Earth Observing System Distributed Information System (EOSDIS). We use performance modeling to identify bottlenecks in EOSDIS for two representative user scenarios from climate change research.

Keyword: climate modeling, performance modeling, parallel I/O, pario-bib

fineberg:pmpio:
PMPIO-A Portable Implementation of MPI-IO. Samuel a. fineberg and parkson wong and bill nitzberg and chris kuszmaul. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 188-195. IEEE Computer Society Press, October 1996.

Abstract: MPI-IO provides a demonstrably efficient portable parallel Input/Output interface, compatible with the MPI standard. PMPIO is a "reference implementation" of MPI-IO, developed at NASA Ames Research Center. To date, PMPIO has been ported to the IBM SP-2, SGI and Sun shared memory workstations, the Intel Paragon, and the Cray J90. Preliminary results using the PMPIO implementation of MPI-IO show an improvement of as much as a factor of 20 on the NAS BTIO benchmark compared to a Fortran based implementation. We show comparative results on the SP-2 Paragon, and SGI architectures.

Keyword: parallel I/O, pario-bib

gibson:nasd-tr:
Garth A. Gibson, David P. Nagle, Khalil Amiri, Fay W. Chang, Eugene Feinberg, Howard Gobioff Chen Lee, Berend Ozceri, Erik Riedel, and David Rochberg. A case for network-attached secure disks. Technical Report CMU-CS-96-142, Carnegie-Mellon University, June 1996.

Keyword: parallel I/O, network-attached storage, distributed file systems, pario-bib

Comment: See http://www.cs.cmu.edu/Groups/NASD/ARPA96/server.html

golubchik:striping:
Leana Golubchik, Richard R. Muntz, and Richard W. Watson. Analysis of striping techniques in robotic storage libraries. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 225-238. IEEE Computer Society Press, September 1995.

Abstract: In recent years advances in computational speed have been the main focus of research and development in high performance computing. In comparison, the improvement in I/O performance has been modest. Faster processing speeds have created a need for faster I/O as well as for the storage and retrieval of vast amounts of data. The technology needed to develop these mass storage systems exists today. Robotic storage libraries are vital components of such systems. However, they normally exhibit high latency and long transmission times. We analyze the performance of robotic storage libraries and study striping as a technique for improving response time. Although striping has been extensively studied in the content of disk arrays, the architectural differences between robotic storage libraries and arrays of disks suggest that a separate study of striping techniques in such libraries would be beneficial.

Keyword: mass storage, parallel I/O, pario-bib

grossman:library:
R. Grossman, X. Qin, W. Xu, H. Hulen, and T. Tyler. An architecture for a scalable high-performance digital library. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 89-98. IEEE Computer Society Press, September 1995.

Abstract: Requirements for a high-performance, scalable digital library of multimedia data are presented together with a layered architecture for a system that addresses the requirements. The approach is to view digital data as persistent collections of complex objects and to use lightweight object management to manage this data. To scale as the amount of data increases, the object management component is layered over a storage management component. The storage management component supports hierarchical storage, third-party data transfer and parallel input-output. Several issues that arise from the interface between the storage management and object management components are discussed. The authors have developed a prototype of a digital library using this design. Two key components of the prototype are AIM Net and HPSS. AIM Net is a persistent object manager and is a product of Oak Park Research. HPSS is the High Performance Storage System, developed by a collaboration including IBM Government Systems and several national labs.

Keyword: mass storage, parallel I/O, pario-bib

johnson:scx:
Steve Johnson and Steve Scott. A supercomputer system interconnect and scalable IOS. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 357-367. IEEE Computer Society Press, September 1995.

Abstract: The evolution of system architectures and system configurations has created the need for a new supercomputer system interconnect. Attributes required of the new interconnect include commonality among system and subsystem types, scalability, low latency, high bandwidth, a high level of resiliency, and flexibility. Cray Research Inc. is developing a new system channel to meet these interconnect requirements in future systems. The channel has a ring-based architecture, but can also function as a point-to-point link. It integrates control and data on a single, physical path while providing low latency and variance for control messages. Extensive features for client isolation, diagnostic capabilities, and fault tolerance have been incorporated into the design. The attributes and features of this channel are discussed along with implementation and protocol specifics.

Keyword: mass storage, I/O architecture, I/O interconnect, supercomputer, parallel I/O, pario-bib

Comment: About the Cray Research SCX channel, capable of 1200 MB/s peak and 900 MB/s delivered throughput.

jones:mpi-io:
Terry Jones, Richard Mark, Jeanne Martin, John May, Elsie Pierce, and Linda Stanberry. An MPI-IO interface to HPSS. In Proceedings of the Fifth NASA Goddard conference on Mass Storage Systems, pages I:37-50, September 1996.

Keyword: mass storage, parallel I/O, multiprocessor file system interface, pario-bib

kandemir:io-optimize:
Mahmut Kandemir, Alok Choudhary, and Rajesh Bordawekar. I/O optimizations for compiling out-of-core programs on distributed-memory machines. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, March 1997. To appear. Extended Abstract.

Abstract: Since many of large scale computational problems usually deal with large quantities of data, optimizing the performance of I/O subsystems of massively parallel machines is an important challenge for system designers. We describe data access reorganization strategies for efficient compilation of out-of-core data-parallel programs on distributed memory machines. Our analytical approach and experimental results indicate that the optimizations introduced in this paper can reduce the amount of time spent in I/O by as much as an order of magnitude on both uniprocessors and multicomputers.

Keyword: verify pages, parallel I/O, compiler, out-of-core, pario-bib

kandemir:optimize:
Mahmut Kandemir, Alok Choudhary, J. Ramanujam, and Rajesh Bordawekar. Optimizing out-of-core computations in uniprocessors. In Proceedings of the Workshop on Interaction between Compilers and Computer Architectures, pages 1-10, February 1997.

Abstract: Programs accessing disk-resident arrays perform poorly in general due to excessive number of I/O calls and insufficient help from compilers. In this paper, in order to alleviate this problem, we propose a series of compiler optimizations. Both the analytical approach we use and the experimental results provide strong evidence that our method is very effective on uniprocessors for out-of-core nests whose data sizes far exceed the size of available memory.

Keyword: verify publisher, parallel I/O, compiler, out-of-core, pario-bib

kandemir:reorganize:
Mahmut Kandemir, Rajesh Bordawekar, and Alok Choudhary. Data access reorganizations in compiling out-of-core data parallel programs on distributed memory machines. In Proceedings of the Eleventh International Parallel Processing Symposium, April 1997.

Abstract: This paper describes optimization techniques for translating out-of-core programs written in a data parallel language to message passing node programs with explicit parallel I/O. We demonstrate that straightforward extension of in-core compilation techniques does not work well for out-of-core programs. We then describe how the compiler can optimize the code by (1) determining appropriate file layouts for out-of-core arrays, (2) permuting the loops in the nest(s) to allow efficient file access, and (3) partitioning the available node memory among references based on I/O cost estimation. Our experimental results indicate that these optimizations can reduce the amount of time spent in I/O by as much as an order of magnitude.

Keyword: verify pages, compiler, data-parallel, out-of-core, parallel I/O, pario-bib

kandemir:tiling:
Mahmut Kandemir, Rajesh Bordawekar, Alok Choudhary, and J. Ramanujam. A unified tiling approach for out-of-core computations. In Sixth Workshop on Compilers for Parallel Computers, pages 323-334, Aachen, Germany, December 1996. Forschungzentrum Julich GmbH. Also available as Caltech Technical Report CACR 130.

Abstract: This paper describes a framework by which an out-of-core stencil program written in a data-parallel language can be translated into node programs in a distributed-memory message-passing machine with explicit I/O and communication. We focus on a technique called \emph{Data Space Tiling} to group data elements into slabs that can fit into memories of processors. Methods to choose \emph{legal} tile shapes under several constraints and deadlock-free scheduling of tiles are investigated. Our approach is \emph{unified} in the sense that it can be applied to both FORALL loops and the loops that involve flow-dependences.

Keyword: parallel I/O, compiler, out-of-core, pario-bib

kimbrel:prefetch:
Tracy Kimbrel, Pei Cao, Edward Felten, Anna Karlin, and Kai Li. Integrating parallel prefetching and caching. In Proceedings of the 1996 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 262-263, Philadelphia, PA, May 1996. ACM Press. Poster paper.

Keyword: disk prefetching, parallel I/O, pario-bib

Comment: They do a theoretical analysis of prefetching and caching in uniprocessor, single- and multi-disk situations, given that they know the complete access sequence; their measure is not hit rate but rather overall execution time. They found some algorithms that are close to optimal.

kimbrel:prefetch-trace:
Tracy Kimbrel, Andrew Tomkins, R. Hugo Patterson, Brian Bershad, Pei Cao, Edward Felten, Garth Gibson, Anna R. Karlin, and Kai Li. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation, pages 19-34. USENIX Association, October 1996.

Abstract: High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are significant interactions between them; a block prefetched too early reduces the effectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate trace-driven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We find that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when significant I/O stalls are not present; and that a single, simple strategy is capable of doing both.

Keyword: parallel I/O, tracing, prefetch, trace-driven simulation, pario-bib

kobler:eosdis:
Ben Kobler, John Berbert, Parris Caulk, and P. C. Hariharan. Architecture and design of storage and data management for the NASA Earth Observing System Data and Information System (EOSDIS). In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 65-76. IEEE Computer Society Press, September 1995.

Abstract: Mission to Planet Earth (MTPE) is a long-term NASA research mission to study the processes leading to global climate change. The EOS Data and Information System (EOSDIS) is the component within MTPE that will provide the Earth science community with easy, affordable, and reliable access to Earth science data. EOSDIS is a distributed system, with major facilities at eight Distributed Active Archive Centers (DAACs) located throughout the United States. At the DAACs the Science Data Processing Segment (SDPS) will receive, process, archive, and manage all data. It is estimated that several hundred gigaflops of processing power will be required to process and archive the several terabytes of new data that will be generated and distributed daily. Thousands of science users and perhaps several hundred thousand nonscience users will access the system.

Keyword: mass storage, I/O architecture, parallel I/O, pario-bib

kotz:app-pario:
David Kotz. Applications of parallel I/O. Technical Report PCS-TR96-297, Dept. of Computer Science, Dartmouth College, October 1996. Release 1.

Abstract: Scientific applications are increasingly being implemented on massively parallel supercomputers. Many of these applications have intense I/O demands, as well as massive computational requirements. This paper is essentially an annotated bibliography of papers and other sources of information about scientific applications using parallel I/O. It will be updated periodically.

Keyword: parallel I/O application, file access patterns, pario-bib

kotz:flexibility2:
David Kotz and Nils Nieuwejaar. Flexibility and performance of parallel file systems. In Proceedings of the Third International Conference of the Austrian Center for Parallel Computation (ACPC), volume 1127 of Lecture Notes in Computer Science, pages 1-11. Springer-Verlag, September 1996.
See also earlier version kotz:flexibility.

Abstract: As we gain experience with parallel file systems, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk-management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make applications difficult to port.

We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (APIs).

We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, re specifically designed with the flexibility and performance necessary to support a wide range of applications.

Keyword: parallel I/O, multiprocessor file system, dfk, pario-bib

Comment: Nearly identical to kotz:flexibility. The only changes are the format, a shorter abstract, and updates to Section 7 and the references.

kotz:tuning:
David Kotz. Tuning STARFISH. Technical Report PCS-TR96-296, Dept. of Computer Science, Dartmouth College, October 1996.

Abstract: STARFISH is a parallel file-system simulator we built for our research into the concept of disk-directed I/O. In this report, we detail steps taken to tune the file systems supported by STARFISH, which include a traditional parallel file system (with caching) and a disk-directed I/O system. In particular, we now support two-phase I/O, use smarter disk scheduling, increased the maximum number of outstanding requests that a compute processor may make to each disk, and added gather/scatter block transfer. We also present results of the experiments driving the tuning effort.

Keyword: parallel I/O, multiprocessor file system, pario-bib

Comment: Reports on some new changes to the STARFISH simulator that implements traditional caching and disk-directed I/O. This is meant mainly as a companion to kotz:jdiskdir. See also kotz:jdiskdir, kotz:diskdir, kotz:expand.

kwong:distribution:
Peter Kwong and Shikaresh Majumdar. Study of data distribution strategies for parallel I/O management. In Proceedings of the Third International Conference of the Austrian Center for Parallel Computation (ACPC), volume 1127 of Lecture Notes in Computer Science, pages 12-23. Springer-Verlag, September 1996.

Abstract: Recent studies have demonstrated that a significant number of I/O operations are performed by a number of classes of different parallel applications. Appropriate I/O management strategies are required however for harnessing the power of parallel I/O. This paper focuses on two I/O management issues that affect system performance in multiprogrammed parallel environments. Characterization of I/O behavior of parallel applications in terms of four different models is discussed first, followed by an investigation of the performance of a number of different data distribution strategies. Using computer simulations this research shows that I/O characteristics of applications and data distribution have an important effect on system performance. Applications that can simultaneously do computation and I/O, plus strategies that can incorporate centralized I/O management are found to be beneficial for a multiprogrammed parallel environment.

Keyword: parallel I/O, pario-bib

Comment: See majumdar:management.

large-scale-memories:
Special issue on large-scale memories. Algorithmica, 1994.
lawlor:parity:
F. D. Lawlor. Efficient mass storage parity recovery mechanism. IBM Technical Disclosure Bulletin, 24(2):986-987, July 1981.

Keyword: parallel I/O, disk array, RAID, pario-bib

Comment: An early paper, perhaps the earliest, that describes the techniques that later became RAID. Lawlor notes how to use parity to recover data lost due to disk crash, as in RAID3, addresses the read-before-write problem by caching the old data block as well as the new data block, and shows how two-dimensional parity can protect against two or more failures.

lee:logical-disks:
Jang Sun Lee, Jungmin Kim, P. Bruce Berra, and Sanjay Ranka. Logical disks: User-controllable I/O for scientific applications. In Proceedings of the 1996 IEEE Symposium on Parallel and Distributed Processing, pages 340-347. IEEE Computer Society Press, October 1996.

Abstract: In this paper we propose user-controllable I/O operations and explore the effects of them with some synthetic access patterns. The operations allow users to determine a file structure matching the access patterns, control the layout and distribution of data blocks on physical disks, and present various access patterns with a minimum number of I/O operations. The operations do not use a file pointer to access data as in typical file systems, which eliminates the overhead of managing the offset of the file, making it easy to share data and reducing the number of I/O operations.

Keyword: logical disks, parallel I/O, pario-bib

lee:petal:
Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84-92, Cambridge, MA, October 1996.

Keyword: parallel I/O, distributed file system, declustering, reliability, pario-bib

Comment: They are trying to build a file server that is easier to manage than most of today's distributed file systems, because disks are cheap but management is expensive. They describe a distributed file server that spreads blocks of all files across many disks and many servers. They use chained declustering so that they can survive loss of server or disk. They dynamically balance load. They dynamically reconfigure when new virtual disks are created or new physical disks are added. They've built it all and are now going to look at possible file systems that can take advantage of the features of Petal.

lee:raidmodel:
Edward K. Lee and Randy H. Katz. An analytic performance model of disk arrays. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 98-109, 1993.

Keyword: disk array, parallel I/O, RAID, analytic model, pario-bib

lee:userio:
Jang Sun Lee, Sang-Gue Oh, Bruce P. Berra, and Sanjay Ranka. User-controllable I/O for parallel computers. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA~'96), pages 442-453, August 1996.

Abstract: This paper presents the design of UPIO, a software for user-controllable parallel input and output. UPIO is designed to maximize I/O performance for scientific applications on MIMD multicomputers. The most important features of UPIO are: It supports a domain-specific file model and a variety of application interfaces to present numerous access patterns. UPIO provides user-contollerable I/O operations that allow users to control data access, file structure, and data distribution. The domain-specific file model and user controllability give low I/O overhead and allow programmers to exploit the aggregate bandwidth of parallel disks.

Keyword: parallel I/O, pario-bib

Comment: They describe an interface that seems to allow easier access for programmers that want to map matrices onto parallel files. The concepts are not well explained, so it's hard to really understand what is new and different. They make no explicit comparison with other advanced interfaces like that in Vesta or Galley. No performance results.

li:recursive-tr:
Zhiyong Li, John H. Reif, and Sandeep K. S. Gupta. Synthesizing efficient out-of-core programs for block recursive algorithms using block-cyclic data distributions. Technical Report 96-04, Dept. of Computer Science, Duke University, March 1996.
See also later version li:recursive.

Abstract: In this paper, we present a framework for synthesizing I/O efficient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory model in which data can be distributed using various cyclic(B) distributions in contrast to the normally used {\it physical track} distribution cyclic(B_d), where B_d is the physical disk block size.

We first introduce tensor bases to capture the semantics of block-cyclic data distributions of out-of-core data and also data access patterns to out-of-core data. We then present program generation techniques for tensor products and matrix transposition. We accurately represent the number of parallel I/O operations required for the synthesized programs for tensor products and matrix transposition as a function of tensor bases and data distributions. We introduce an algorithm to determine the data distribution which optimizes the performance of the synthesized programs. Further, we formalize the procedure of synthesizing efficient out-of-core programs for tensor product formulas with various block-cyclic distributions as a dynamic programming problem.

We demonstrate the effectiveness of our approach through several examples. We show that the choice of an appropriate data distribution can reduce the number of passes to access out-of-core data by as large as eight times for a tensor product, and the dynamic programming approach can largely reduce the number of passes to access out-of-core data for the overall tensor product formulas.

Keyword: parallel I/O, out-of-core algorithm, pario-bib

ligon:pfs:
W. B. Ligon and R. B. Ross. Implementation and performance of a parallel file system for high performance distributed applications. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 471-480. IEEE Computer Society Press, August 1996.

Abstract: Dedicated cluster parallel computers (DCPCs) are emerging as low-cost high performance environments for many important applications in science and engineering. A significant class of applications that perform well on a DCPC are coarse-grain applications that involve large amounts of file I/O. Current research in parallel file systems for distributed systems is providing a mechanism for adapting these applications to the DCPC environment. We present the Parallel Virtual File System (PVFS), a system that provides disk striping across multiple nodes in a distributed parallel computer and file partitioning among tasks in a parallel program. PVFS is unique among similar systems in that it uses a stream-based approach that represents each file access with a single set of request parameters and decouples the number of network messages from details of the file striping and partitioning. PVFS also provides support for efficient collective file accesses and allows overlapping file partitions. We present results of early performance experiments that show PVFS achieves excellent speedups in accessing moderately sized file segments.

Keyword: parallel I/O, cluster computing, parallel file system, pario-bib

madhyasta:adaptive:
Tara M. Madhyasta, Christopher L. Elford, and Daniel A. Reed. Optimizing input/output using adaptive file system policies. In Proceedings of the Fifth NASA Goddard conference on Mass Storage Systems, pages II:493-514, September 1996.

Keyword: multiprocessor file system, prefetching, caching, parallel I/O, multiprocessor file system interface, pario-bib

madhyastha:adaptive:
Tara M. Madhyastha and Daniel A. Reed. Intelligent, adaptive file system policy selection. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 172-179. IEEE Computer Society Press, October 1996.

Abstract: Traditionally, maximizing input/output performance has required tailoring application input/output patterns to the idiosyncrasies of specific input/output systems. The authors show that one can achieve high application input/output performance via a low overhead input/output system that automatically recognizes file access patterns and adaptively modifies system policies to match application requirements. This approach reduces the application developer's input/output optimization effort by isolating input/output optimization decisions within a retargetable file system infrastructure. To validate these claims, they have built a lightweight file system policy testbed that uses a trained learning mechanism to recognize access patterns. The file system then uses these access pattern classifications to select appropriate caching strategies, dynamically adapting file system policies to changing input/output demands throughout application execution. The experimental data show dramatic speedups on both benchmarks and input/output intensive scientific applications.

Keyword: parallel I/O, pario-bib

majumdar:characterize:
S. Majumdar and Yiu Ming Leung. Characterization of applications with I/O for processor scheduling in multiprogrammed parallel systems. In Proceedings of the 1994 IEEE Symposium on Parallel and Distributed Processing, pages 298-307. IEEE Computer Society Press, 1994.

Abstract: Most studies of processor scheduling in multiprogrammed parallel systems have ignored the I/O performed by applications. Recent studies have demonstrated that significant I/O operations are performed by a number of different classes of parallel applications. This paper focuses on some basic issues that underlie scheduling in multiprogrammed parallel environments running applications with I/O. Characterization of the I/O behavior of parallel applications is discussed first. Based on simulation models this research investigates the influence of these I/O characteristics on processor scheduling.

Keyword: workload characterization, scheduling, parallel I/O, pario-bib

malluhi:pss:
Qutaibah Malluhi and William E. Johnston. Approaches for a reliable high-performance distributed-parallel storage system. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 500-509. IEEE Computer Society Press, August 1996.

Abstract: The paper studies different schemes to enhance the reliability, availability and security of a high performance distributed storage system. We have previously designed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high speed wide area network to achieve scalability and high data throughput. The general approach of the paper employs erasure error correcting codes to add data redundancy that can be used to retrieve missing information caused by hardware, software, or human faults. The paper suggests techniques for reducing the communication and computation overhead incurred while retrieving missing data blocks form redundant information. These techniques include clustering, multidimensional coding, and the full two dimensional parity scheme.

Keyword: parallel I/O, pario-bib

matthews:hippi:
Kevin C. Matthews. Experiences implementing a shared file system on a HIPPI disk array. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 77-88. IEEE Computer Society Press, September 1995.

Abstract: Shared file systems which use a physically shared mass storage device have existed for many years, although not on UNIX based operating systems. This paper describes a shared file system (SFS) that was implemented first as a special project on the Gray Research Inc. (CRI) UNICOS operating system. A more general product was then built on top of this project using a HIPPI disk array for the shared mass storage. The design of SFS is outlined, as well as some performance experiences with the product. We describe how SFS interacts with the OSF distributed file service (DFS) and with the CRI data migration facility (DMF). We also describe possible development directions for the SFS product.

Keyword: mass storage, distributed file system, parallel I/O, pario-bib

matthijs:framework:
F. Matthijs, Y. Berbers, and P. Verbaeten. A flexible I/O framework for parallel and distributed systems. In Proceedings of the Fifth International Workshop on Object Orientation in Operating Systems, pages 187-190. IEEE Computer Society Press, 1995.

Abstract: We propose a framework for I/O in parallel and distributed systems. The framework is highly customizable and extendible, and enables programmers to offer high level objects in their applications, without requiring them to struggle with the low level and sometimes complex details of high performance distributed I/O. Also, the framework exploits application specific information to improve I/O performance by allowing specialized programmers to customize the framework. Internally, we use indirection and granularity control to support migration, dynamic load balancing, fault tolerance, etc. for objects of the I/O system, including those representing application data.

Keyword: input-output programs, object-oriented, parallel systems; I/O performance, migration, dynamic load balancing, fault tolerance, parallel I/O, pario-bib

menasce:mass:
Daniel Menasc\'e, Odysseas Ionnis Pentakalos, and Yelena Yesha. An analytic model of hierarchical mass storage systems with network-attached storage devices. In Proceedings of the 1996 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 180-189, Philadelphia, PA, May 1996. ACM Press.

Keyword: network attached peripherals, analytic model, mass storage, parallel I/O, pario-bib

moore:ddio:
Jason A. Moore and Michael J. Quinn. Enhancing disk-directed I/O for fine-grained redistribution of file data. Parallel Computing, 1997. To appear.

Keyword: verify publication date and pages, parallel I/O, multiprocessor file system, interprocessor communication, pario-bib

Comment: They propose several enhancements to disk-directed I/O (see kotz:diskdir) that aim to improve performance on fine-grained distributions, that is, where each block from the disk is broken into small pieces that are scattered among the compute processors. One enhancement combines multiple pieces, possibly from separate disk blocks, into a single message. Another is to use two-phase I/O (see delrosario:two-phase), but to use disk-directed I/O to read data from the disks into CP memories, efficiently, then permute. This latter technique is probably faster than normal two-phase I/O that uses a traditional file system, not disk-directed I/O, for the read.

moore:stream-tr:
Jason A. Moore, Philip J. Hatcher, and Michael J. Quinn. Stream*: Fast, flexible, data-parallel I/O. Technical Report 94-80-13, Oregon State University, 1994. Updated September 1995.
See also later version moore:stream.

Abstract: Although hardware supporting parallel file I/O has improved greatly since the introduction of first-generation parallel computers, the programming interface has not. Each vendor provides a different logical view of parallel files as well as nonportable operations for manipulating files. Neither do parallel languages provide standards for performing I/O. In this paper, we describe a view of parallel files for data-parallel languages, dubbed Stream*, in which each virtual processor writes to and reads from its own stream. In this scheme each virtual processor's I/O operations have the same familiar, unambiguous meaning as in a sequential C program. We demonstrate how I/O operations in Stream* can run as fast as those of vendor-specific parallel file systems on the operations most often encountered in data-parallel programs. We show how this system supports general virtual processor operations for debugging and elemental functions. Finally, we present empirical results from a prototype Stream* system running on a Meiko CS-2 multicomputer.

Keyword: data parallel, parallel I/O, pario-bib

Comment: See moore:stream; nearly identical. See also moore:detection. This paper gives a little bit earlier description of the Stream* idea than does moore:detection, but you'd be pretty much complete just reading moore:detection.

more:mtio:
Sachin More, Alok Choudhary, Ian Foster, and Ming Q. Xu. MTIO a multi-threaded parallel I/O system. In Proceedings of the Eleventh International Parallel Processing Symposium, April 1997.

Abstract: This paper presents the design and evaluation of a multi-threaded runtime library for parallel I/O. We extend the multi-threading concept to separate the compute and I/O tasks in two separate threads of control. Multi-threading in our design permits a) asynchronous I/O even if the underlying file system does not support asynchronous I/O; b) copy avoidance from the I/O thread to the compute thread by sharing address space; and c) a capability to perform collective I/O asynchronously without blocking the compute threads. Further, this paper presents techniques for collective I/O which maximize load balance and concurrency while reducing communication overhead in an integrated fashion. Performance results on IBM SP2 for various data distributions and access patterns are presented. The results show that there is a tradeoff between the amount of concurrency in I/O and the buffer size designated for I/O; and there is an optimal buffer size beyond which benefits of larger requests diminish due to large communication overheads.

Keyword: verify pages, threads, parallel I/O, pario-bib

mowry:prefetch:
Todd C. Mowry, Angela K. Demke, and Orran Krieger. Automatic compiler-inserted i/o prefetching for out-of-core applications. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation, pages 3-17. USENIX Association, October 1996.

Abstract: Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve ``out-of-core'' problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and release hints for managing I/O, and the operating system cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-of-core versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with two applications speeding up by threefold or more.

Keyword: compiler, prefetch, parallel I/O, pario-bib

Comment: Best Paper Award

moyer:jcharacterize:
Steven A. Moyer and V.S. Sunderam. Characterizing concurrency control performance for the PIOUS parallel file system. Journal of Parallel and Distributed Computing, 38(1):81-91, October 1996.
See also earlier version moyer:characterize.

Keyword: parallel I/O, multiprocessor file system, pario-bib

mueck:multikey:
T. A. Mueck and J. Witzmann. Multikey index support for tuple sets on parallel mass storage systems. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 136-145, September 1995.

Abstract: The development and evaluation of a tuple set manager (TSM) based on multikey index data structures is a main part of the PARABASE project at the University of Vienna. The TSM provides access to parallel mass storage systems using tuple sets instead of conventional files as the central data structure for application programs. A proof-of-concept prototype TSM is already implemented and operational on an iPSC/2. It supports tuple insert and delete operations as well as exact match, partial match, and range queries at system call level. Available results are from this prototype on the one hand and from various performance evaluation figures. The evaluation results demonstrate the performance gain achieved by the implementation of the tuple set management concept on a parallel mass storage system.

Keyword: parallel database, mass storage, parallel I/O, pario-bib

myllymaki:buffering:
Jussi Myllymaki and Miron Livny. Efficient buffering for concurrent disk tape I/O. Performance Evaluation: An International Journal, 27/28:453-471, 1996. Performance~'96.

Keyword: buffering, file caching, tertiary storage, tape robot, file migration, parallel I/O, pario-bib

Comment: Ways to use secondary and tertiary storage in parallel, and buffering mechanisms for applications with concurrent I/O requirements.

nakajo:jump1:
Hironori Nakajo. A simulation-based evaluation of a disk I/O subsystem for a massively parallel computer: JUMP-1. In Proceedings of the Sixteenth International Conference on Distributed Computer Systems, pages 562-569. IEEE Computer Society Press, May 1996.

Abstract: JUMP-1 is a distributed shared-memory massively parallel computer and is composed of multiple clusters of interconnected network called RDT (Recursive Diagonal Torus). Each cluster in JUMP-1 consists of 4 element processors, secondary cache memories, and 2 MBP (Memory Based Processor) for high-speed synchronization and communication among clusters. The I/O subsystem is connected to a cluster via a high-speed serial link called STAFF-Link. The I/O buffer memory is mapped onto the JUMP-1 global shared-memory to permit each I/O access operation as memory access. In this paper we describe evaluation of the fundamental performance of the disk I/O subsystem using event-driven simulation, and estimated performance with a Video On Demand (VOD) application.

Keyword: parallel I/O, I/O architecture, pario-bib

natarajan:clusterio:
Chita Natarajan and Ravishankar K. Iyer. Measurement and simulation based performance analysis of parallel I/O in a high-performance cluster system. In Proceedings of the 1996 IEEE Symposium on Parallel and Distributed Processing, pages 332-339. IEEE Computer Society Press, October 1996.

Abstract: This paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. The measurements were used to characterize the performance bottlenecks and the throughput limits at the compute and I/O nodes, and to provide realistic input parameters to PioSim, a simulation environment we have developed to investigate parallel I/O performance issues in cluster systems. PioSim was used to obtain a detailed characterization of parallel I/O performance, in the high performance cluster system, for different regular access patterns and different system configurations. This paper also explores the use of local disks at the compute nodes for parallel I/O, and finds that the local disk architecture outperforms the traditional parallel I/O over remote I/O node disks architecture, even when as much as 68-75\% of the requests from each compute node goes to remote disks.

Keyword: performance analysis, parallel I/O, pario-bib

nieplocha:arrays:
Jarek Nieplocha and Ian Foster. Disk resident arrays: An array-oriented I/O library for out-of-core computations. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196-204. IEEE Computer Society Press, October 1996.

Abstract: In out-of-core computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However the tools used to manage this storage are typically quite different from those used to manage access to local and remote memory. This disparity complicates implementation of out-of-core algorithms and hinders portability. We describe a programming model that addresses this problem. This model allows parallel programs to use essentially the same mechanisms to manage the movement of data between any two adjacent levels in a hierarchical memory system. We take as our starting point the Global Arrays shared-memory model and library, which support a variety of operations on distributed arrays, including transfer between local and remote memories. We show how this model can be extended to support explicit transfer between global memory and secondary storage, and we define a Disk Resident Arrays Library that supports such transfers. We illustrate the utility of the resulting model with two applications, an out-of-core matrix multiplication and a large computational chemistry program. We also describe implementation techniques on several parallel computers and present experimental results that demonstrate that the Disk Resident Arrays model can be implemented very efficiently on parallel computers.

Keyword: parallel I/O, pario-bib

nieuwejaar:jgalley:
Nils Nieuwejaar and David Kotz. The Galley parallel file system. Parallel Computing, 1997. To appear.
See also earlier version nieuwejaar:jgalley-tr.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

Keyword: verify month and pages, parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

Comment: A revised version of nieuwejaar:jgalley-tr, which is a combination of nieuwejaar:galley and nieuwejaar:galley-perf.

nieuwejaar:jgalley-tr:
Nils Nieuwejaar and David Kotz. The Galley parallel file system. Technical Report PCS-TR96-286, Dept. of Computer Science, Dartmouth College, May 1996. To appear in {\em Parallel Computing}.
See also earlier version nieuwejaar:galley.
See also later version nieuwejaar:jgalley.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

Keyword: parallel file system, parallel I/O, multiprocessor file system interface, pario-bib, dfk

nieuwejaar:thesis:
Nils A. Nieuwejaar. Galley: A New Parallel File System for Parallel Applications. PhD thesis, Dept. of Computer Science, Dartmouth College, November 1996. Available as Dartmouth Technical Report PCS-TR96-300.

Abstract: Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access those multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated application and library programmers to use knowledge about their I/O to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. In this work we examine current multiprocessor file systems, as well as how those file systems are used by scientific applications. Contrary to the expectations of the designers of current parallel file systems, the workloads on those systems are dominated by requests to read and write small pieces of data. Furthermore, rather than being accessed sequentially and contiguously, as in uniprocessor and supercomputer workloads, files in multiprocessor file systems are accessed in regular, structured, but non-contiguous patterns. Based on our observations of multiprocessor workloads, we have designed Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. In this work, we introduce Galley and discuss its design and implementation. We describe Galley's new three-dimensional file structure and discuss how that structure can be used by parallel applications to achieve higher performance. We introduce several new data-access interfaces, which allow applications to explicitly describe the regular access patterns we found to be common in parallel file system workloads. We show how these new interfaces allow parallel applications to achieve tremendous increases in I/O performance. Finally, we discuss how Galley's new file structure and data-access interfaces can be useful in practice.

Keyword: parallel I/O, multiprocessor file system, file system workload characterization, file access patterns, file system interface, pario-bib

nieuwejaar:workload:
Nils Nieuwejaar, David Kotz, Apratim Purakayastha, Carla Schlatter Ellis, and Michael Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, October 1996.
See also earlier version nieuwejaar:workload-tr.

Abstract: Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems.

The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design.

Keyword: parallel I/O, file system workload, workload characterization, file access pattern, multiprocessor file system, dfk, pario-bib

Comment: See also kotz:workload, nieuwejaar:strided, ap:workload.

nodine:deterministic:
M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proceedings of the Fifth Symposium on Parallel Algorithms and Architectures, pages 120-129, Velen, Germany, 1993.

Abstract: We present an elegant deterministic load balancing strategy for distribution sort that is applicable to a wide variety of parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm for external sorting with multiple disks and parallel processors. In each input/output (I/O) operation, each of the $D \geq 1$ disks can simultaneously transfer a block of $B$ contiguous records. Our two measures of performance are the number of I/Os and the amount of work done by the CPU(s); our algorithm is simultaneously optimal for both measures. We also show how to sort deterministically in parallel memory hierarchies. When the processors are interconnected by any sort of a PRAM, our algorithms are optimal for all parallel memory hierarchies; when the interconnection network is a hypercube, our algorithms are either optimal or best-known.

Comment: Short version of nodine:sort2 and nodine:sortdisk.

nurmi:atm:
Marc A. Nurmi, William E. Bejcek, Rod N. Gregoire, K. C. Liu, and Mark D. Pohl. Automatic management of CPU and I/O bottlenecks in distributed applications on ATM networks. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 481-489. IEEE Computer Society Press, August 1996.

Abstract: Existing parallel programming environments for networks of workstations improve the performance of computationally intensive applications by using message passing or virtual shared memory to alleviate CPU bottlenecks. This paper describes an approach based on message passing that addresses both CPU and I/O bottlenecks for a specific class of distributed applications on ATM networks. ATM provides the bandwidth required to utilize multiple I/O channels in parallel. This paper also describes an environment based on distributed process management and centralized application management that implements the approach. The environment adds processes to a running application when necessary to alleviate CPU and I/O bottlenecks while managing process connections in a manner that is transparent to the application.

Keyword: parallel I/O, ATM, parallel networking, pario-bib

ober:seismic:
Curtis Ober, Ron Oldfield, John VanDyke, and David Womble. Seismic imaging on massively parallel computers. Technical Report SAND96-1112, Sandia National Laboratories, April 1996.

Abstract: Fast, accurate imaging of complex, oil-bearing geologies, such as overthrusts and salt domes, is the key to reducing the costs of domestic oil and gas exploration. Geophysicists say that the known oil reserves in the Gulf of Mexico could be significantly increased if accurate seismic imaging beneath salt domes was possible. A range of techniques exist for imaging these regions, but the highly accurate techniques involve the solution of the wave equation and are characterized by large data sets and large computational demands. Massively parallel computers can provide the computational power for these highly accurate imaging techniques.

A brief introduction to seismic processing will be presented, and the implementation of a seismic-imaging code for distributed memory computers will be discussed. The portable code, Salvo, performs a wave-equation-based, 3-D, prestack, depth imaging and currently runs on the Intel Paragon, the Cray T3D and SGI Challenge series. It uses MPI for portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the Intel Paragon.

Keyword: multiprocessor application, scientific computing, seismic data processing, parallel I/O, pario-bib

Comment: 2 pages about their I/O scheme, mostly regarding a calculation of the optimal balance between compute nodes and I/O nodes to achieve perfect overlap.

park:interface:
Yoonho Park, Ridgway Scott, and Stuart Sechrest. Virtual memory versus file interfaces for large, memory-intensive scientific applications. In Proceedings of Supercomputing '96. ACM Press and IEEE Computer Society Press, November 1996. Also available as UH Department of Computer Science Research Report UH-CH-96-7.

Abstract: Scientific applications often require some strategy for temporary data storage to do the largest possible simulations. The use of virtual memory for temporary data storage has received criticism because of performance problems. However, modern virtual memory found in recent operating systems such as Cenju-3/DE give application writers control over virtual memory policies. We demonstrate that custom virtual memory policies can dramatically reduce virtual memory overhead and allow applications to run out-of-core efficiently. We also demonstrate that the main advantage of virtual memory, namely programming simplicity, is not lost.

Keyword: virtual memory, file interface, scientific applications, out-of-core, parallel I/O, pario-bib

Comment: Web and CDROM only.

salmon:nbody:
John Salmon and Michael Warren. Parallel out-of-core methods for N-body simulation. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, 1997.

Abstract: Hierarchical treecodes have, to a large extent, converted the compute-bound N-body problem into a memory-bound problem. The large ratio of DRAM to disk pricing suggests use of out-of-core techniques to overcome memory capacity limitations. We will describe a parallel, out-of-core treecode library, targeted at machines with independent secondary storage associated with each processor. Borrowing the space-filling curve techniques from our in-core library, and ``manually'' paging, results in excellent spatial and temporal locality and very good performance.

Keyword: verify pages and month, parallel I/O, out of core applications, scientific computing, pario-bib

scheuermann:partition2:
Peter Scheuermann, Gerhard Weikum, and Peter Zabback. Data partitioning and load balancing in parallel disk systems. Technical Report A/02/96, Universit\"at Des Saarlandes, SaarBr\"ucken, Germany, April 1996. Submitted to VLDB Journal.
See also earlier version scheuermann:partition.

Keyword: verify, parallel I/O, disk array, disk striping, load balance, pario-bib

Comment: Updated version of scheuermann:partition.

seamons:jpanda:
Kent E. Seamons and Marianne Winslett. Multidimensional array I/O in Panda 1.0. Proceedings of Supercomputing '96, 10(2):191-211, 1996.
See also earlier version seamons:interface.

Keyword: parallel I/O, collective I/O, pario-bib

seamons:thesis:
Kent E. Seamons. Panda: Fast Access to Persistent Arrays Using High Level Interfaces and Server Directed Input/Output. PhD thesis, University of Illinois at Urbana-Champaign, May 1996.

Abstract: Multidimensional arrays are a fundamental data type in scientific computing and are used extensively across a broad range of applications. Often these arrays are persistent, i.e., they outlive the invocation of the program that created them. Portability and performance with respect to input and output (i/o) pose significant challenges to applications accessing large persistent arrays, especially in distributed-memory environments. A significant number of scientific applications perform conceptually simple array i/o operations, such as reading or writing a subarray, an entire array, or a list of arrays. However, the algorithms to perform these operations efficiently on a given platform may be complex and non-portable, and may require costly customizations to operating system software.

This thesis presents a high-level interface for array i/o and three implementation architectures, embodied in the Panda (Persistence AND Arrays) array i/o library. The high-level interface contributes to application portability, by encapsulating unnecessary details and being easy to use. Performance results using Panda demonstrate that an i/o system can provide application programs with a high-level, portable, easy-to-use interface for array i/o without sacrificing performance or requiring custom system software; in fact, combining all these benefits may only be possible through a high-level interface due to the great freedom and flexibility a high-level interface provides for the underlying implementation.

The Panda server-directed i/o architecture is a prime example of an efficient implementation of collective array i/o for closely synchronized applications in distributed-memory single-program multiple-data (SPMD) environments. A high-level interface is instrumental to the good performance of server-directed i/o, since it provides a global view of an upcoming collective i/o operation that Panda uses to plan sequential reads and writes. Performance results show that with server-directed i/o, Panda achieves throughputs close to the maximum AIX file system throughput on the i/o nodes of the IBM SP2 when reading and writing large multidimensional arrays.

Keyword: parallel I/O, persistent data, parallel computing, pario-bib

Comment: see also chen:panda, seamons:panda, seamons:compressed, seamons:interface, seamons:schemas, seamons:msio, seamons:jpanda

shriver:api-tr:
Elizabeth A. M. Shriver and Leonard F. Wisniewski. An API for choreographing data accesses. Technical Report PCS-TR95-267, Dept. of Computer Science, Dartmouth College, November 1995.

Abstract: Current APIs for multiprocessor multi-disk file systems are not easy to use in developing out-of-core algorithms that choreograph parallel data accesses. Consequently, the efficiency of these algorithms is hard to achieve in practice. We address this deficiency by specifying an API that includes data-access primitives for data choreography. With our API, the programmer can easily access specific blocks from each disk in a single operation, thereby fully utilizing the parallelism of the underlying storage system. Our API supports the development of libraries of commonly-used higher-level routines such as matrix-matrix addition, matrix-matrix multiplication, and BMMC (bit-matrix-multiply/complement) permutations. We illustrate our API in implementations of these three high-level routines to demonstrate how easy it is to use.

Keyword: parallel I/O, multiprocessor file system interface, pario-bib

Comment: Also published as Courant Institute Tech Report 708.

si-woong:cluster:
Jang Si-Woong, Chung Ki-Dong, and Sam Coleman. Design and implementation of a network-wide concurrent file system in a workstation cluster. In Proceedings of the Fourteenth IEEE Symposium on Mass Storage Systems, pages 239-245. IEEE Computer Society Press, September 1995.

Abstract: We estimate the performance of a network-wide concurrent file system implemented using conventional disks as disk arrays. Tests were carried out on both single system and network-wide environments. On single systems, a file was split across several disks to test the performance of file I/O operations. We concluded that performance was proportional to the number of disks, up to four, on a system with high computing power. Performance of a system with low computing power, however, did not increase, even with more than two disks. When we split a file across disks in a network-wide system called the Network-wide Concurrent File System (N-CFS), we found performance similar to or slightly higher than that of disk arrays on single systems. Since file access through N-CFS is transparent, this system enables traditional disks on single and networked systems to be used as disk arrays for I/O intensive jobs.

Keyword: mass storage, cluster computing, distributed file system, parallel I/O, pario-bib

smirni:evolutionary:
Evgenia Smirni, Ruth A. Aydt, Andrew A. Chien, and Daniel A. Reed. I/O requirements of scientific applications: An evolutionary view. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pages 49-59, Syracuse, NY, 1996. IEEE Computer Society Press.

Abstract: The modest I/O configurations and file system limitations of many current high-performance systems preclude solution of problems with large I/O needs. I/O hardware and file system parallelism is the key to achieving high performance. We analyze the I/O behavior of several versions of two scientific applications on the Intel Paragon XP/S. The versions involve incremental application code enhancements across multiple releases of the operating system. Studying the evolution of I/O access patterns underscores the interplay between application access patterns and file system features. Our results show that both small and large request sizes are common, that at present, application developers must manually aggregate small requests to obtain high disk transfer rates, that concurrent file accesses are frequent, and that appropriate matching of the application access pattern and the file system access mode can significantly increase application I/O performance. Based on these results, we describe a set of file system design principles.

Keyword: I/O, workload characterization, scientific computing, parallel I/O, pario-bib

Comment: They study two applications over several versions, using Pablo to capture the I/O activity. They thus watch as application developers improve the applications use of I/O modes and request sizes. Both applications move through three phases: initialization, computation (with out-of-core I/O or checkpointing I/O), and output. They found it necessary to tune the I/O request sizes to match the parameters of the I/O system. In the initial versions, the code used small read and write requests, which were (according to the developers) the "easiest and most natural implementation for their I/O." They restructured the I/O to make bigger requests, which better matched the capabilities of Intel PFS. They conclude that asynchronous and collective operations are imperative. They would like to see a file system that can adapt dynamically to adjust its policies to the apparent access patterns. Automatic request aggregation of some kind seems like a good idea; of course, that is one feature of a buffer cache.

srinilta:strategies:
Chutimet Srinilta, Divyesh Jadav, and Alok Choudhary. Design and evaluation of data storage and retrieval strategies in a distributed memory continuous media server. In Proceedings of the Eleventh International Parallel Processing Symposium, April 1997.

Abstract: High performance servers and high-speed networks will form the backbone of the infra-structure required for distributed multimedia information systems. Given that the goal of such a server is to support hundreds of interactive data streams simultaneously, various tradeoffs are possible with respect to the storage of data on secondary memory, and its retrieval therefrom. In this paper we identify and evaluate these tradeoffs. We evaluate the effect of varying the stripe factor and also the performance of batched retrieval of disk-resident data. We develop a methodology to predict the stream capacity of such a server. The evaluation is done for both uniform and skewed access patterns. Experimental results on the Intel Paragon computer are presented.

Keyword: verify pages, threads, parallel I/O, pario-bib

subramaniam:msthesis:
Mahesh Subramaniam. Efficient implementation of server-directed i/o. Master's thesis, Dept. of Computer Science, University of Illinois, June 1996.

Abstract: Parallel computers are a cost effective approach to providing significant computational resources to a broad range of scientific and engineering applications. Due to the relatively lower performance of the I/O subsystems on these machines and due to the significant I/O requirements of these applications, the I/O performance can become a major bottleneck. Optimizing the I/O phase of these applications poses a significant challenge. A large number of these scientific and engineering applications perform simple operations on multidimensional arrays and providing an easy and efficient mechanism for implementing these operations is important. The Panda array I/O library provides simple high level interfaces to specify collective I/O operations on multidimensional arrays in a distributed memory single-program multiple-data (SPMD) environment. The high level information provided by the user through these interfaces allows the Panda array I/O library to produce an efficient implementation of the collective I/O request. The use of these high level interfaces also increases the portability of the application.

This thesis presents an efficient and portable implementation of the Panda array I/O library. In this implementation, standard software components are used to build the I/O library to aid its portability. The implementation also provides a simple, flexible framework for the implementation and integration of the various collective I/O strategies. The server directed I/O and the reduced messages server directed I/O algorithms are implemented in the Panda array I/O library. This implementation supports the sharing of the I/O servers between multiple applications by extending the collective I/O strategies. Also, the implementation supports the use of part time I/O nodes where certain designated compute nodes act as the I/O servers during the I/O phase of the application. The performance of this implementation of the Panda array I/O library is measured on the IBM SP2 and the performance results show that for read and write operations, the collective I/O strategies used by the Panda array I/O library achieve throughputs close to the maximum throughputs provided by the underlying file system on each I/O node of the IBM SP2.

Keyword: parallel I/O, multiprocessor file system, pario-bib

thakur:abstract:
Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 180-187, October 1996.
See also earlier version thakur:abstract-tr.

Abstract: In this paper, we propose a strategy for implementing parallel-I/O interfaces portably and efficiently. We have defined an abstract-device interface for parallel I/O, called ADIO. Any parallel-I/O API can be implemented on multiple file systems by implementing the API portably on top of ADIO, and implementing only ADIO on different file systems. This approach simplifies the task of implementing an API and yet exploits the specific high-performance features of individual file systems. We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our performance studies indicate that the overhead of using ADIO as an implementation strategy is very low.

Keyword: parallel I/O, multiprocessor file system interface, pario-bib

thakur:abstract-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. Technical Report MCS-P592-0596, Argonne National Laboratory, Mathematics and Computer Science Division, May 1996.
See also later version thakur:abstract.

Keyword: multiprocessor file system interface, parallel I/O, pario-bib

Comment: They propose an intermediate interface that can serve as an implementation base for all parallel file-system APIs, and which can itself be implemented on top of all parallel file systems. This ``universal'' interface allows all apps to run on all file systems with no porting, and for people to experiment with different APIs.

thakur:evaluation-tr:
Rajeev Thakur, William Gropp, and Ewing Lusk. An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application. Technical Report MCS-P569-0296, Argonne National Laboratory, February 1996.
See also later version thakur:evaluation.

Abstract: This paper presents the results of an experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon. For the evaluation, we used a full, three-dimensional application code that is in production use for studying the nonlinear evolution of Jeans instability in self-gravitating gaseous clouds. The application performs I/O by using library routines that we developed and optimized separately for parallel I/O on the SP and Paragon. The I/O routines perform two-phase I/O and use the PIOFS file system on the SP and PFS on the Paragon. We studied the I/O performance for two different sizes of the application. We found that for the small case, I/O was faster on the SP, whereas for the large case, I/O took almost the same time on both systems. Communication required for I/O was faster on the Paragon in both cases. The highest read bandwidth obtained was 48 Mbytes/sec. and the highest write bandwidth obtained was 31.6 Mbytes/sec., both on the SP.

Keyword: parallel I/O, multiprocessor file system, pario-bib

Comment: This version no longer on the web.

thakur:jext2phase:
Rajeev Thakur and Alok Choudhary. An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays. Scientific Programming, 5(4):301-317, Winter 1996.
See also earlier version thakur:ext2phase2.

Abstract: A number of applications on parallel computers deal with very large data sets that cannot fit in main memory. In such applications, data must be stored in files on disks and fetched into memory during program execution. Parallel programs with large out-of-core arrays stored in files must read/write smaller sections of the arrays from/to files. In this article, we describe a method for accessing sections of out-of-core arrays efficiently. Our method, the extended two-phase method, uses collective I/O: Processors cooperate to combine several I/O requests into fewer larger granularity requests, reorder requests so that the file is accessed in proper sequence, and eliminate simultaneous I/O requests for the same data. In addition, the I/O workload is divided among processors dynamically, depending on the access requests. We present performance results obtained from two real out-of-core parallel applications--matrix multiplication and a Laplace's equation solver--and several synthetic access patterns, all on the Intel Touchstone Delta. These results indicate that the extended two-phase method significantly outperformed a direct (noncollective) method for accessing out-of-core array sections.

Keyword: parallel I/O, pario-bib

thakur:jpassion:
Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996.

Keyword: parallel I/O, pario-bib

Comment: See thakur:passion, choudhary:passion.

thomas:panda:
Joel T. Thomas. The Panda array I/O library on the Galley parallel file system. Technical Report PCS-TR96-288, Dept. of Computer Science, Dartmouth College, June 1996. Senior Honors Thesis.

Abstract: The Panda Array I/O library, created at the University of Illinois, Urbana-Champaign, was built especially to address the needs of high-performance scientific applications. I/O has been one of the most frustrating bottlenecks to high performance for quite some time, and the Panda project is an attempt to ameliorate this problem while still providing the user with a simple, high-level interface. The Galley File System, with its hierarchical structure of files and strided requests, is another attempt at addressing the performance problem. My project was to redesign