Keyword: compiler, I/O, pario-bib
Comment: Not really about parallel applications or
parallel I/O, but I think it may be of interest to that community. They
propose a compiler framework for a compiler to insert asynchronous I/O
operations (start I/O, finish I/O), to satisfy the dependency constraints of
the program.
Abstract: High-performance parallel file systems
are needed to satisfy tremendous I/O requirements of parallel scientific
applications. The design of such parallel file systems depends on a
comprehensive understanding of the expected workload, but so far there have
been very few usage studies of multiprocessor file systems. In the first part
of this dissertation, we attempt to fill this void by measuring a real
file-system workload on a production parallel machine, namely the CM-5 at the
National Center for Supercomputing Applications. We collect information about
nearly every individual I/O request from the mix of jobs running on the
machine. Analysis of the traces leads to various recommendations for design
of future parallel file systems. Our usage study showed that writes to
write-only files are a dominant part of the workload. Therefore, optimizing
writes could have a significant impact on overall performance. In the second
part of this dissertation, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. Within
its framework, ENWRICH uses a recently proposed high performance
implementation of collective I/O operations called disk-directed I/O, but it
eliminates a number of limitations of disk-directed I/O. ENWRICH combines
low-overhead write caching at the compute processors with high performance
disk-directed I/O at the I/O processors to achieve both low latency and high
bandwidth. This combination facilitates the use of the powerful disk-directed
I/O technique independent of any particular choice of interface, and without
the requirement for mapping libraries at the I/O processors. By collecting
writes over many files and applications, ENWRICH lets the I/O processors
optimize disk I/O over a large pool of requests. We evaluate our design of
ENWRICH using simulated implementation and extensive experimentation. We show
that ENWRICH achieves high performance for various configurations and
workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for
certain workloads, and suggest possible enhancements. Finally, we discuss the
nuances of implementing ENWRICH on a real platform and speculate about
possible adaptations of ENWRICH for emerging multiprocessing platforms.
Keyword: parallel I/O, multiprocessor file system,
file access patterns, workload characterization, file caching, disk-directed
I/O, pario-bib
Comment: See also ap:enwrich, ap:workload, and
nieuwejaar:workload
Abstract: The significant difference between the
speeds of the I/O system (e.g., disks) and compute processors in parallel
systems creates a bottleneck that lowers the performance of an application
that does a considerable amount of disk accesses. A major portion of the
compute processors' time is wasted on waiting for I/O to complete. This
problem can be addressed to a certain extent, if the necessary data can be
fetched from the disk before the I/O call to the disk is issued. Fetching
data ahead of time, known as prefetching in a multiprocessor environment
depends a great deal on the application's access pattern. The subject of this
paper is implementation and performance evaluation of a prefetching prototype
in a production parallel file system on the Intel Paragon. Specifically, this
paper presents a) design and implementation of a prefetching strategy in the
parallel file system and b) performance measurements and evaluation of the
file system with and without prefetching. The prototype is designed at the
operating system level for the PFS. It is implemented in the PFS subsystem of
the Intel Paragon Operating System. It is observed that in many cases
prefetching provides considerable performance improvements. In some other
cases no improvements or some performance degradation is observed due to the
overheads incurred in prefetching.
Keyword: parallel I/O, prefetching, multiprocessor
file system, pario-bib
Comment: See arunachalam:prefetch.
Abstract: A majority of parallel applications
obtain parallelism by partitioning data over multiple processors. Accessing
distributed data structures like arrays from files often requires each
processor to make a large number of small non-contiguous data requests. This
problem can be addressed by replacing small non-contiguous requests by large
collective requests. This approach, known as Collective I/O, has been found
to work extremely well in practice. In this paper, we describe implementation
and evaluation of a collective I/O prototype in a production parallel file
system on the Intel Paragon. The prototype is implemented in the PFS
subsystem of the Intel Paragon Operating System. We evaluate the collective
I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O
modes. It is observed that collective I/O provides significant performance
improvement over accesses in M_UNIX mode. However, in many cases, various
implementation overheads cause collective I/O to provide lower performance
than the M_RECORD I/O mode.
Keyword: parallel I/O, mutliprocessor file system,
pario-bib
Abstract: It is widely acknowledged that
improving parallel I/O performance is critical for widespread adoption of
high performance computing. In this paper, we show that communication in
out-of-core distributed memory problems may require both inter-processor
communication and file I/O. Thus, in order to improve I/O performance, it is
necessary to minimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method called the generalized collective
communication method follows a loosely synchronous model; computation and
communication phases are clearly separated, and communication requires
permutation of data in files. The second method called the receiver-driven
in-core communication considers only communication required of each in-core
data slab individually. The third method called the owner-driven in-core
communication goes even one step further and tries to identify the potential
future use of data (by the recipients) while it is in the sender's memory. We
describe these methods in detail and present a simple heuristic to choose a
communication method from among the three methods. We then provide
performance results for two out-of-core applications, the two-dimensional FFT
code and the two-dimensional elliptic Jacobi solver. Finally, we discuss how
the out-of-core and in-core communication methods can be used in virtual
memory environments on distributed memory machines.
Comment: See also bordawekar:comm, at ICS'95.
Abstract: In this paper, we describe a framework
for optimizing communication and I/O costs in out-of-core problems. We focus
on communication and I/O optimization within a FORALL construct. We show that
existing frameworks do not extend directly to out-of-core problems and can
not exploit the FORALL semantics. We present a unified framework for the
placement of I/O and communication calls and apply it for optimizing
communication for stencil applications. Using the experimental results, we
demonstrate that correct placement of I/O and communication calls can
completely eliminate extra file I/O from communication and obtain significant
performance improvement.
Keyword: parallel I/O, compiler, pario-bib
Abstract: For an increasing number of data
intensive scientific applications, parallel I/O concepts are a major
performance issue. Tackling this issue, we provide an outline of an
input/output system designed for highly efficient, scalable and conveniently
usable parallel I/O on distributed memory systems. The main focus of this
paper is the parallel I/O runtime system support provided for
software-generated programs produced by parallelizing compilers in the
context of High Performance FORTRAN efforts. Specifically, our design is
presented in the context of the Vienna Fortran Compilation System.
Keyword: compiler transformations, runtime
support, parallel I/O, prefetching, pario-bib
Keyword: parallel I/O, high performance mass
storage system, high performance languages, compilation techniques, data
administration, pario-bib
Keyword: parallel I/O, out of core, irregular
applications, compiler, pario-bib
Keyword: parallel I/O, RAID, pario-bib
Comment: A parallelized RAID architecture that
distributes the RAID controller operations across several worker nodes.
Multiple hosts can connect to different workers, allowing multiple paths into
the array. The workers then communicate on their own fast interconnect to
accomplish the requests, distributing parity computations across multiple
workers. They get much better performance and reliability than plain RAID.
They built a prototype and a performance simulator. Two-phase commit was
needed for request atomicity, and a request sequencer was needed for
serialization. Also found it was good to give the whole request info to all
workers and to let them figure out what to do and when. Superceded by
cao:tickertaip-tr2 and cao:tickertaip.
Abstract: This paper gives an overview of the I/O
data mapping mechanisms of {\em ParFiSys}. Grouped management and
parallelization are presented as relevant features. I/O data mapping
mechanisms of {\em ParFiSys}, including all levels of the hierarchy, are
described in this paper.
Keyword: parallel I/O, multiprocessor file system,
pario-bib
Keyword: parallel I/O, I/O architecture, pario-bib
Comment: See carretero:*, rosales:cds, perez:clfs.
Abstract: In today's workstation based
environment, applications such as design databases, multimedia databases, and
knowledge bases do not fit well into the relational data processing
framework. The object-oriented data model has been proposed to model and
process such complex databases. Due to the nature of the supported
applications, object-oriented database systems need efficient mechanisms for
the retrieval of complex objects and the navigation along the semantic links
among objects. Object clustering and buffering have been suggested as
efficient mechanisms for the retrieval of complex objects. However, to
improve the efficiency of the aforementioned operations, one has to look at
the recent advances in storage technology. This paper is an attempt to
investigate the feasibility of using parallel disks for object-oriented
databases. It analyzes the conceptual changes needed to map the clustering
and buffering schemes proposed on the new underlying architecture. The
simulation and performance evaluation of the proposed leveled-clustering and
mapping schemes utilizing parallel I/O disks are presented and analyzed.
Keyword: parallel I/O, disk array, object oriented
database, pario-bib
Abstract: We present an analytical performance
model for Panda, a library for synchronized i/o of large multidimensional
arrays on parallel and sequential platforms, and show how the Panda
developers use this model to evaluate Panda's parallel i/o performance and
guide future Panda development. The model validation shows that system
developers can simplify performance analysis, identify potential performance
bottlenecks, and study the design trade-offs for Panda on massively parallel
platforms more easily than by conducting empirical experiments. More
importantly, we show that the outputs of the performance model can be used to
help make optimal plans for handling application i/o requests, the first step
toward our long-term goal of automatically optimizing i/o request handling in
Panda.
Keyword: performance modeling, parallel I/O,
pario-bib
Comment: Web and CDROM only.
Keyword: verify pages, parallel I/O, RAID, disk
array, pario-bib
Our techniques apply to a number of problems, including
list ranking, which we discuss in detail, finding Euler tours,
expression-tree evaluation, centroid decomposition of a tree, least-common
ancestors, minimum spanning tree verification, connected and biconnected
components, minimum spanning forest, ear decomposition, topological sorting,
reachability, graph drawing, and visibility representation. Abstract: We present a collection of new
techniques for designing and analyzing efficient external-memory algorithms
for graph problems and illustrate how these techniques can be applied to a
wide variety of specific problems. Our results include: \begin{itemize} \item
{\em Proximate-neighboring}. We present a simple method for deriving
external-memory lower bounds via reductions from a problem we call the
``proximate neighbors'' problem. We use this technique to derive non-trivial
lower bounds for such problems as list ranking, expression tree evaluation,
and connected components. \item {\em PRAM simulation}. We give methods for
efficiently simulating PRAM computations in external memory, even for some
cases in which the PRAM algorithm is not work-optimal. We apply this to
derive a number of optimal (and simple) external-memory graph algorithms.
\item {\em Time-forward processing}. We present a general technique for
evaluating circuits (or ``circuit-like'' computations) in external memory. We
also use this in a deterministic list ranking algorithm. \item {\em
Deterministic 3-coloring of a cycle}. We give several optimal methods for
3-coloring a cycle, which can be used as a subroutine for finding large
independent sets for list ranking. Our ideas go beyond a straightforward PRAM
simulation, and may be of independent interest. \item {\em External
depth-first search}. We discuss a method for performing depth first search
and solving related problems efficiently in external memory. Our technique
can be used in conjunction with ideas due to Ullman and Yannakakis in order
to solve graph problems involving closed semi-ring computations even when
their assumption that vertices fit in main memory does not hold.
\end{itemize}
Abstract: The paper presents an analytical model
of a whole disk array architecture, XDAC, which consists of several major
subsystems and features: the two-dimensional array structure; IO-bus with
split transaction protocol; and cache for processing multiple I/O requests in
parallel. Our modelling approach is based on a subsystem access time per
request (SATPR) concept, in which we model for each subsystem the mean access
time per disk array request. The model is fed with a given set of
representative workload parameters and then used to conduct performance
analysis for exploring the impact of fork/join synchronization as well as
evaluating some architectural design issues of the XDAC system. Moreover, by
comparing the SATPRs of subsystems, we can identify the bottleneck for
performance improvements.
Keyword: disk array, performance evaluation,
analytical model, parallel I/O, pario-bib
Keyword: file system, database, parallel I/O,
pario-bib
Comment: A position paper for the Strategic
Directions in Computer Research workshop at MIT in June 1996.
Abstract: We explore the method of combining the
replication and parity approaches to tolerate multiple disk failures in a
disk array. In addition to the conventional mirrored and chained declustering
methods, a method based on the hybrid of mirrored-and-chained declustering is
explored. A performance study that explores the effect of combining
replication and parity approaches is conducted. It is experimentally shown
that the proposed approach can lead to the most cost-effective solution if
the objective is to sustain the same load as before the failures.
Keyword: fault tolerance, disk array, replication,
declustering, parallel I/O, pario-bib
Comment: Consider hybrid chained and mirrored
declustering.
Keyword: multiprocessor file system, Vesta,
parallel I/O, pario-bib
Comment: See also corbett:pfs, corbett:vesta*,
feitelson:pario. This is the ultimate Vesta reference. There seem to be only
a few small things that are completely new over what's been published
elsewhere, although this presentation is much more complete and polished.
Keyword: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Specs of the proposed SIO low-level
interface for parallel file systems. Key features: linear file model,
scatter-gather read and write calls (list of strided segments), asynch
versions of all calls, extensive hint system. Naming structure is
unspecified; no directories specified. Permissions left out. Some control
over client caching and over disk layout. Each file has a (small) 'label',
which is just a little space for application-controlled meta data. Optional
extensions: collective read and write calls, fast copy.
The results indicate the
following. First, good PDM algorithms are usually not I/O bound. Second, of
the four PDM parameters, two (problem size and memory size) are good
indicators of I/O time and running time, but the other two (block size and
number of disks) are not. Third, because PDM algorithms tend not to be I/O
bound, asynchronous I/O effectively hides I/O times. The software
interface to the PDM is part of the ViC* run-time library. The interface is a
set of wrappers that are designed to be both efficient and portable across
several parallel file systems and target machines. Abstract: Although several algorithms have been
developed for the Parallel Disk Model (PDM), few have been implemented.
Consequently, little has been known about the accuracy of the PDM in
measuring I/O time and total time to perform an out-of-core computation. This
paper analyzes timing results on a uniprocessor with several disks for two
PDM algorithms, out-of-core radix sort and BMMC permutations, to determine
the strengths and weaknesses of the PDM.
Keyword: parallel I/O, parallel I/O algorithm,
compiler, pario-bib
Keyword: verify month number volume and pages,
parallel I/O, out of core, scientific computing, FFT, pario-bib
Abstract: The Fast Fourier Transform (FFT) plays
a key role in many areas of computational science and engineering. Although
most one-dimensional FFT problems can be entirely solved entirely in main
memory, some important classes of applications require out-of-core
techniques. For these, use of parallel I/O systems can improve performance
considerably. This paper shows how to perform one-dimensional FFTs using a
parallel disk system with independent disk accesses. We present both
analytical and experimental results for performing out-of-core FFTs in two
ways: using traditional virtual memory with demand paging, and using a
provably asymptotically optimal algorithm for the Parallel Disk Model (PDM)
of Vitter and Shriver. When run on a DEC 2100 server with a large memory and
eight parallel disks, the optimal algorithm for the PDM runs up to 144.7
times faster than in-core methods under demand paging. Moreover, even
including I/O costs, the normalized times for the optimal PDM algorithm are
competitive, or better than, those for in-core methods even when they run
entirely in memory.
Keyword: parallel I/O, out of core, scientific
computing, FFT, pario-bib
Abstract: This paper extends an earlier
out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the
Parallel Disk Model (PDM) to use multiple processors. Four out-of-core
multiprocessor methods are examined. Operationally, these methods differ in
the size of "mini-butterfly" computed in memory and how the data are
organized on the disks and in the distributed memory of the multiprocessor.
The methods also perform differing amounts of I/O and communication. Two of
them have the remarkable property that even though they are computing the FFT
on a multiprocessor, all interprocessor communication occurs outside the
mini-butterfly computations. Performance results on a small workstation
cluster indicate that except for unusual combinations of problem size and
memory size, the methods that do not perform interprocessor communication
during the mini-butterfly computations require approximately 86\% of the time
of those that do. Moreover, the faster methods are much easier to implement.
Keyword: parallel I/O, out of core, scientific
computing, FFT, pario-bib
Comment: Extends the work of cormen:fft.
Keyword: file caching, multiprocessor file system,
cooperative caching, parallel I/O, pario-bib
Comment: See cortes:paca.
Abstract: In this paper we describe PAFS, a new
parallel/distributed file system. Within the whole file system, special
interest is placed on the caching mechanism. We present a cooperative cache
that has the advantages of cooperation and avoids the problems derived from
the coherence mechanisms. Furthermore, this has been achieved with a
reasonable gain in performance. In order to show the obtained performance, we
present a comparison between PAFS and xFS (a file system that also implements
a cooperative cache).
Keyword: verify pages, file caching,
multiprocessor file system, cooperative caching, cache coherence, parallel
I/O, pario-bib
Comment: Contact toni@ac.upc.es.
Keyword: workload characterization, scientific
computing, parallel programming, message passing, pario-bib
Comment: Some mention of I/O.
Abstract: Many parallel application areas that
exploit massive parallelism, such as climate modeling, require massive
storage systems for the archival and retrieval of data sets. As such,
advances in massively parallel computation must be coupled with advances in
mass storage technology in order to satisfy I/O constraints of these
applications. We demonstrate the effects of such I/O-computation disparity
for a representative distributed information system, NASA's Earth Observing
System Distributed Information System (EOSDIS). We use performance modeling
to identify bottlenecks in EOSDIS for two representative user scenarios from
climate change research.
Keyword: climate modeling, performance modeling,
parallel I/O, pario-bib
Abstract: MPI-IO provides a demonstrably
efficient portable parallel Input/Output interface, compatible with the MPI
standard. PMPIO is a "reference implementation" of MPI-IO, developed at NASA
Ames Research Center. To date, PMPIO has been ported to the IBM SP-2, SGI and
Sun shared memory workstations, the Intel Paragon, and the Cray J90.
Preliminary results using the PMPIO implementation of MPI-IO show an
improvement of as much as a factor of 20 on the NAS BTIO benchmark compared
to a Fortran based implementation. We show comparative results on the SP-2
Paragon, and SGI architectures.
Keyword: parallel I/O, pario-bib
Keyword: parallel I/O, network-attached storage,
distributed file systems, pario-bib
Comment: See
http://www.cs.cmu.edu/Groups/NASD/ARPA96/server.html
Abstract: In recent years advances in
computational speed have been the main focus of research and development in
high performance computing. In comparison, the improvement in I/O performance
has been modest. Faster processing speeds have created a need for faster I/O
as well as for the storage and retrieval of vast amounts of data. The
technology needed to develop these mass storage systems exists today. Robotic
storage libraries are vital components of such systems. However, they
normally exhibit high latency and long transmission times. We analyze the
performance of robotic storage libraries and study striping as a technique
for improving response time. Although striping has been extensively studied
in the content of disk arrays, the architectural differences between robotic
storage libraries and arrays of disks suggest that a separate study of
striping techniques in such libraries would be beneficial.
Keyword: mass storage, parallel I/O, pario-bib
Abstract: Requirements for a high-performance,
scalable digital library of multimedia data are presented together with a
layered architecture for a system that addresses the requirements. The
approach is to view digital data as persistent collections of complex objects
and to use lightweight object management to manage this data. To scale as the
amount of data increases, the object management component is layered over a
storage management component. The storage management component supports
hierarchical storage, third-party data transfer and parallel input-output.
Several issues that arise from the interface between the storage management
and object management components are discussed. The authors have developed a
prototype of a digital library using this design. Two key components of the
prototype are AIM Net and HPSS. AIM Net is a persistent object manager and is
a product of Oak Park Research. HPSS is the High Performance Storage System,
developed by a collaboration including IBM Government Systems and several
national labs.
Keyword: mass storage, parallel I/O, pario-bib
Abstract: The evolution of system architectures
and system configurations has created the need for a new supercomputer system
interconnect. Attributes required of the new interconnect include commonality
among system and subsystem types, scalability, low latency, high bandwidth, a
high level of resiliency, and flexibility. Cray Research Inc. is developing a
new system channel to meet these interconnect requirements in future systems.
The channel has a ring-based architecture, but can also function as a
point-to-point link. It integrates control and data on a single, physical
path while providing low latency and variance for control messages. Extensive
features for client isolation, diagnostic capabilities, and fault tolerance
have been incorporated into the design. The attributes and features of this
channel are discussed along with implementation and protocol specifics.
Keyword: mass storage, I/O architecture, I/O
interconnect, supercomputer, parallel I/O, pario-bib
Comment: About the Cray Research SCX channel,
capable of 1200 MB/s peak and 900 MB/s delivered throughput.
Keyword: mass storage, parallel I/O,
multiprocessor file system interface, pario-bib
Abstract: Since many of large scale computational
problems usually deal with large quantities of data, optimizing the
performance of I/O subsystems of massively parallel machines is an important
challenge for system designers. We describe data access reorganization
strategies for efficient compilation of out-of-core data-parallel programs on
distributed memory machines. Our analytical approach and experimental results
indicate that the optimizations introduced in this paper can reduce the
amount of time spent in I/O by as much as an order of magnitude on both
uniprocessors and multicomputers.
Keyword: verify pages, parallel I/O, compiler,
out-of-core, pario-bib
Abstract: Programs accessing disk-resident arrays
perform poorly in general due to excessive number of I/O calls and
insufficient help from compilers. In this paper, in order to alleviate this
problem, we propose a series of compiler optimizations. Both the analytical
approach we use and the experimental results provide strong evidence that our
method is very effective on uniprocessors for out-of-core nests whose data
sizes far exceed the size of available memory.
Keyword: verify publisher, parallel I/O, compiler,
out-of-core, pario-bib
Abstract: This paper describes optimization
techniques for translating out-of-core programs written in a data parallel
language to message passing node programs with explicit parallel I/O. We
demonstrate that straightforward extension of in-core compilation techniques
does not work well for out-of-core programs. We then describe how the
compiler can optimize the code by (1) determining appropriate file layouts
for out-of-core arrays, (2) permuting the loops in the nest(s) to allow
efficient file access, and (3) partitioning the available node memory among
references based on I/O cost estimation. Our experimental results indicate
that these optimizations can reduce the amount of time spent in I/O by as
much as an order of magnitude.
Keyword: verify pages, compiler, data-parallel,
out-of-core, parallel I/O, pario-bib
Abstract: This paper describes a framework by
which an out-of-core stencil program written in a data-parallel language can
be translated into node programs in a distributed-memory message-passing
machine with explicit I/O and communication. We focus on a technique called
\emph{Data Space Tiling} to group data elements into slabs that can fit into
memories of processors. Methods to choose \emph{legal} tile shapes under
several constraints and deadlock-free scheduling of tiles are investigated.
Our approach is \emph{unified} in the sense that it can be applied to both
FORALL loops and the loops that involve flow-dependences.
Keyword: parallel I/O, compiler, out-of-core,
pario-bib
Keyword: disk prefetching, parallel I/O, pario-bib
Comment: They do a theoretical analysis of
prefetching and caching in uniprocessor, single- and multi-disk situations,
given that they know the complete access sequence; their measure is not hit
rate but rather overall execution time. They found some algorithms that are
close to optimal.
Abstract: High-performance I/O systems depend on
prefetching and caching in order to deliver good performance to applications.
These two techniques have generally been considered in isolation, even though
there are significant interactions between them; a block prefetched too early
reduces the effectiveness of the cache, while a block cached too long reduces
the effectiveness of prefetching. In this paper we study the effects of
several combined prefetching and caching strategies for systems with multiple
disks. Using disk-accurate trace-driven simulation, we explore the
performance characteristics of each of the algorithms in cases in which
applications provide full advance knowledge of accesses using hints. Some of
the strategies have been published with theoretical performance bounds, and
some are components of systems that have been built. One is a new algorithm
that combines the desirable characteristics of the others. We find that when
performance is limited by I/O stalls, aggressive prefetching helps to
alleviate the problem; that more conservative prefetching is appropriate when
significant I/O stalls are not present; and that a single, simple strategy is
capable of doing both.
Keyword: parallel I/O, tracing, prefetch,
trace-driven simulation, pario-bib
Abstract: Mission to Planet Earth (MTPE) is a
long-term NASA research mission to study the processes leading to global
climate change. The EOS Data and Information System (EOSDIS) is the component
within MTPE that will provide the Earth science community with easy,
affordable, and reliable access to Earth science data. EOSDIS is a
distributed system, with major facilities at eight Distributed Active Archive
Centers (DAACs) located throughout the United States. At the DAACs the
Science Data Processing Segment (SDPS) will receive, process, archive, and
manage all data. It is estimated that several hundred gigaflops of processing
power will be required to process and archive the several terabytes of new
data that will be generated and distributed daily. Thousands of science users
and perhaps several hundred thousand nonscience users will access the system.
Keyword: mass storage, I/O architecture, parallel
I/O, pario-bib
Abstract: Scientific applications are
increasingly being implemented on massively parallel supercomputers. Many of
these applications have intense I/O demands, as well as massive computational
requirements. This paper is essentially an annotated bibliography of papers
and other sources of information about scientific applications using parallel
I/O. It will be updated periodically.
Keyword: parallel I/O application, file access
patterns, pario-bib
We
propose that the traditional functionality of parallel file systems be
separated into two components: a fixed core that is standard on all
platforms, encapsulating only primitive abstractions and interfaces, and a
set of high-level libraries to provide a variety of abstractions and
application-programmer interfaces (APIs). We present our current and
next-generation file systems as examples of this structure. Their features,
such as a three-dimensional file structure, strided read and write
interfaces, and I/O-node programs, re specifically designed with the
flexibility and performance necessary to support a wide range of
applications. Abstract: As we gain experience with parallel
file systems, it becomes increasingly clear that a single solution does not
suit all applications. For example, it appears to be impossible to find a
single appropriate interface, caching policy, file structure, or
disk-management strategy. Furthermore, the proliferation of file-system
interfaces and abstractions make applications difficult to port.
Keyword: parallel I/O, multiprocessor file system,
dfk, pario-bib
Comment: Nearly identical to kotz:flexibility. The
only changes are the format, a shorter abstract, and updates to Section 7 and
the references.
Abstract: STARFISH is a parallel file-system
simulator we built for our research into the concept of disk-directed I/O. In
this report, we detail steps taken to tune the file systems supported by
STARFISH, which include a traditional parallel file system (with caching) and
a disk-directed I/O system. In particular, we now support two-phase I/O, use
smarter disk scheduling, increased the maximum number of outstanding requests
that a compute processor may make to each disk, and added gather/scatter
block transfer. We also present results of the experiments driving the tuning
effort.
Keyword: parallel I/O, multiprocessor file system,
pario-bib
Comment: Reports on some new changes to the
STARFISH simulator that implements traditional caching and disk-directed I/O.
This is meant mainly as a companion to kotz:jdiskdir. See also kotz:jdiskdir,
kotz:diskdir, kotz:expand.
Abstract: Recent studies have demonstrated that a
significant number of I/O operations are performed by a number of classes of
different parallel applications. Appropriate I/O management strategies are
required however for harnessing the power of parallel I/O. This paper focuses
on two I/O management issues that affect system performance in
multiprogrammed parallel environments. Characterization of I/O behavior of
parallel applications in terms of four different models is discussed first,
followed by an investigation of the performance of a number of different data
distribution strategies. Using computer simulations this research shows that
I/O characteristics of applications and data distribution have an important
effect on system performance. Applications that can simultaneously do
computation and I/O, plus strategies that can incorporate centralized I/O
management are found to be beneficial for a multiprogrammed parallel
environment.
Keyword: parallel I/O, pario-bib
Comment: See majumdar:management.
Keyword: parallel I/O, disk array, RAID, pario-bib
Comment: An early paper, perhaps the earliest,
that describes the techniques that later became RAID. Lawlor notes how to use
parity to recover data lost due to disk crash, as in RAID3, addresses the
read-before-write problem by caching the old data block as well as the new
data block, and shows how two-dimensional parity can protect against two or
more failures.
Abstract: In this paper we propose
user-controllable I/O operations and explore the effects of them with some
synthetic access patterns. The operations allow users to determine a file
structure matching the access patterns, control the layout and distribution
of data blocks on physical disks, and present various access patterns with a
minimum number of I/O operations. The operations do not use a file pointer to
access data as in typical file systems, which eliminates the overhead of
managing the offset of the file, making it easy to share data and reducing
the number of I/O operations.
Keyword: logical disks, parallel I/O, pario-bib
Keyword: parallel I/O, distributed file system,
declustering, reliability, pario-bib
Comment: They are trying to build a file server
that is easier to manage than most of today's distributed file systems,
because disks are cheap but management is expensive. They describe a
distributed file server that spreads blocks of all files across many disks
and many servers. They use chained declustering so that they can survive loss
of server or disk. They dynamically balance load. They dynamically
reconfigure when new virtual disks are created or new physical disks are
added. They've built it all and are now going to look at possible file
systems that can take advantage of the features of Petal.
Keyword: disk array, parallel I/O, RAID, analytic
model, pario-bib
Abstract: This paper presents the design of UPIO,
a software for user-controllable parallel input and output. UPIO is designed
to maximize I/O performance for scientific applications on MIMD
multicomputers. The most important features of UPIO are: It supports a
domain-specific file model and a variety of application interfaces to present
numerous access patterns. UPIO provides user-contollerable I/O operations
that allow users to control data access, file structure, and data
distribution. The domain-specific file model and user controllability give
low I/O overhead and allow programmers to exploit the aggregate bandwidth of
parallel disks.
Keyword: parallel I/O, pario-bib
Comment: They describe an interface that seems to
allow easier access for programmers that want to map matrices onto parallel
files. The concepts are not well explained, so it's hard to really understand
what is new and different. They make no explicit comparison with other
advanced interfaces like that in Vesta or Galley. No performance results.
We first introduce tensor
bases to capture the semantics of block-cyclic data distributions of
out-of-core data and also data access patterns to out-of-core data. We then
present program generation techniques for tensor products and matrix
transposition. We accurately represent the number of parallel I/O operations
required for the synthesized programs for tensor products and matrix
transposition as a function of tensor bases and data distributions. We
introduce an algorithm to determine the data distribution which optimizes the
performance of the synthesized programs. Further, we formalize the procedure
of synthesizing efficient out-of-core programs for tensor product formulas
with various block-cyclic distributions as a dynamic programming problem.
We demonstrate the effectiveness of our approach through several
examples. We show that the choice of an appropriate data distribution can
reduce the number of passes to access out-of-core data by as large as eight
times for a tensor product, and the dynamic programming approach can largely
reduce the number of passes to access out-of-core data for the overall tensor
product formulas. Abstract: In this paper, we present a framework
for synthesizing I/O efficient out-of-core programs for block recursive
algorithms, such as the fast Fourier transform (FFT) and block matrix
transposition algorithms. Our framework uses an algebraic representation
which is based on tensor products and other matrix operations. The programs
are optimized for the striped Vitter and Shriver's two-level memory model in
which data can be distributed using various cyclic(B) distributions in
contrast to the normally used {\it physical track} distribution cyclic(B_d),
where B_d is the physical disk block size.
Keyword: parallel I/O, out-of-core algorithm,
pario-bib
Abstract: Dedicated cluster parallel computers
(DCPCs) are emerging as low-cost high performance environments for many
important applications in science and engineering. A significant class of
applications that perform well on a DCPC are coarse-grain applications that
involve large amounts of file I/O. Current research in parallel file systems
for distributed systems is providing a mechanism for adapting these
applications to the DCPC environment. We present the Parallel Virtual File
System (PVFS), a system that provides disk striping across multiple nodes in
a distributed parallel computer and file partitioning among tasks in a
parallel program. PVFS is unique among similar systems in that it uses a
stream-based approach that represents each file access with a single set of
request parameters and decouples the number of network messages from details
of the file striping and partitioning. PVFS also provides support for
efficient collective file accesses and allows overlapping file partitions. We
present results of early performance experiments that show PVFS achieves
excellent speedups in accessing moderately sized file segments.
Keyword: parallel I/O, cluster computing, parallel
file system, pario-bib
Keyword: multiprocessor file system, prefetching,
caching, parallel I/O, multiprocessor file system interface, pario-bib
Abstract: Traditionally, maximizing input/output
performance has required tailoring application input/output patterns to the
idiosyncrasies of specific input/output systems. The authors show that one
can achieve high application input/output performance via a low overhead
input/output system that automatically recognizes file access patterns and
adaptively modifies system policies to match application requirements. This
approach reduces the application developer's input/output optimization effort
by isolating input/output optimization decisions within a retargetable file
system infrastructure. To validate these claims, they have built a
lightweight file system policy testbed that uses a trained learning mechanism
to recognize access patterns. The file system then uses these access pattern
classifications to select appropriate caching strategies, dynamically
adapting file system policies to changing input/output demands throughout
application execution. The experimental data show dramatic speedups on both
benchmarks and input/output intensive scientific applications.
Keyword: parallel I/O, pario-bib
Abstract: Most studies of processor scheduling in
multiprogrammed parallel systems have ignored the I/O performed by
applications. Recent studies have demonstrated that significant I/O
operations are performed by a number of different classes of parallel
applications. This paper focuses on some basic issues that underlie
scheduling in multiprogrammed parallel environments running applications with
I/O. Characterization of the I/O behavior of parallel applications is
discussed first. Based on simulation models this research investigates the
influence of these I/O characteristics on processor scheduling.
Keyword: workload characterization, scheduling,
parallel I/O, pario-bib
Abstract: The paper studies different schemes to
enhance the reliability, availability and security of a high performance
distributed storage system. We have previously designed a distributed
parallel storage system that employs the aggregate bandwidth of multiple data
servers connected by a high speed wide area network to achieve scalability
and high data throughput. The general approach of the paper employs erasure
error correcting codes to add data redundancy that can be used to retrieve
missing information caused by hardware, software, or human faults. The paper
suggests techniques for reducing the communication and computation overhead
incurred while retrieving missing data blocks form redundant information.
These techniques include clustering, multidimensional coding, and the full
two dimensional parity scheme.
Keyword: parallel I/O, pario-bib
Abstract: Shared file systems which use a
physically shared mass storage device have existed for many years, although
not on UNIX based operating systems. This paper describes a shared file
system (SFS) that was implemented first as a special project on the Gray
Research Inc. (CRI) UNICOS operating system. A more general product was then
built on top of this project using a HIPPI disk array for the shared mass
storage. The design of SFS is outlined, as well as some performance
experiences with the product. We describe how SFS interacts with the OSF
distributed file service (DFS) and with the CRI data migration facility
(DMF). We also describe possible development directions for the SFS product.
Keyword: mass storage, distributed file system,
parallel I/O, pario-bib
Abstract: We propose a framework for I/O in
parallel and distributed systems. The framework is highly customizable and
extendible, and enables programmers to offer high level objects in their
applications, without requiring them to struggle with the low level and
sometimes complex details of high performance distributed I/O. Also, the
framework exploits application specific information to improve I/O
performance by allowing specialized programmers to customize the framework.
Internally, we use indirection and granularity control to support migration,
dynamic load balancing, fault tolerance, etc. for objects of the I/O system,
including those representing application data.
Keyword: input-output programs, object-oriented,
parallel systems; I/O performance, migration, dynamic load balancing, fault
tolerance, parallel I/O, pario-bib
Keyword: network attached peripherals, analytic
model, mass storage, parallel I/O, pario-bib
Keyword: verify publication date and pages,
parallel I/O, multiprocessor file system, interprocessor communication,
pario-bib
Comment: They propose several enhancements to
disk-directed I/O (see kotz:diskdir) that aim to improve performance on
fine-grained distributions, that is, where each block from the disk is broken
into small pieces that are scattered among the compute processors. One
enhancement combines multiple pieces, possibly from separate disk blocks,
into a single message. Another is to use two-phase I/O (see
delrosario:two-phase), but to use disk-directed I/O to read data from the
disks into CP memories, efficiently, then permute. This latter technique is
probably faster than normal two-phase I/O that uses a traditional file
system, not disk-directed I/O, for the read.
Abstract: Although hardware supporting parallel
file I/O has improved greatly since the introduction of first-generation
parallel computers, the programming interface has not. Each vendor provides a
different logical view of parallel files as well as nonportable operations
for manipulating files. Neither do parallel languages provide standards for
performing I/O. In this paper, we describe a view of parallel files for
data-parallel languages, dubbed Stream*, in which each virtual processor
writes to and reads from its own stream. In this scheme each virtual
processor's I/O operations have the same familiar, unambiguous meaning as in
a sequential C program. We demonstrate how I/O operations in Stream* can run
as fast as those of vendor-specific parallel file systems on the operations
most often encountered in data-parallel programs. We show how this system
supports general virtual processor operations for debugging and elemental
functions. Finally, we present empirical results from a prototype Stream*
system running on a Meiko CS-2 multicomputer.
Keyword: data parallel, parallel I/O, pario-bib
Comment: See moore:stream; nearly identical. See
also moore:detection. This paper gives a little bit earlier description of
the Stream* idea than does moore:detection, but you'd be pretty much complete
just reading moore:detection.
Abstract: This paper presents the design and
evaluation of a multi-threaded runtime library for parallel I/O. We extend
the multi-threading concept to separate the compute and I/O tasks in two
separate threads of control. Multi-threading in our design permits a)
asynchronous I/O even if the underlying file system does not support
asynchronous I/O; b) copy avoidance from the I/O thread to the compute thread
by sharing address space; and c) a capability to perform collective I/O
asynchronously without blocking the compute threads. Further, this paper
presents techniques for collective I/O which maximize load balance and
concurrency while reducing communication overhead in an integrated fashion.
Performance results on IBM SP2 for various data distributions and access
patterns are presented. The results show that there is a tradeoff between the
amount of concurrency in I/O and the buffer size designated for I/O; and
there is an optimal buffer size beyond which benefits of larger requests
diminish due to large communication overheads.
Keyword: verify pages, threads, parallel I/O,
pario-bib
Abstract: Current operating systems offer poor
performance when a numeric application's working set does not fit in main
memory. As a result, programmers who wish to solve ``out-of-core'' problems
efficiently are typically faced with the onerous task of rewriting an
application to use explicit I/O operations (e.g., read/write). In this paper,
we propose and evaluate a fully-automatic technique which liberates the
programmer from this task, provides high performance, and requires only
minimal changes to current operating systems. In our scheme, the compiler
provides the crucial information on future access patterns without burdening
the programmer, the operating system supports non-binding prefetch and
release hints for managing I/O, and the operating system cooperates with a
run-time layer to accelerate performance by adapting to dynamic behavior and
minimizing prefetch overhead. This approach maintains the abstraction of
unlimited virtual memory for the programmer, gives the compiler the
flexibility to aggressively move prefetches back ahead of references, and
gives the operating system the flexibility to arbitrate between the competing
resource demands of multiple applications. We have implemented our scheme
using the SUIF compiler and the Hurricane operating system. Our experimental
results demonstrate that our fully-automatic scheme effectively hides the I/O
latency in out-of-core versions of the entire NAS Parallel benchmark suite,
thus resulting in speedups of roughly twofold for five of the eight
applications, with two applications speeding up by threefold or more.
Keyword: compiler, prefetch, parallel I/O,
pario-bib
Comment: Best Paper Award
Keyword: parallel I/O, multiprocessor file system,
pario-bib
Abstract: The development and evaluation of a
tuple set manager (TSM) based on multikey index data structures is a main
part of the PARABASE project at the University of Vienna. The TSM provides
access to parallel mass storage systems using tuple sets instead of
conventional files as the central data structure for application programs. A
proof-of-concept prototype TSM is already implemented and operational on an
iPSC/2. It supports tuple insert and delete operations as well as exact
match, partial match, and range queries at system call level. Available
results are from this prototype on the one hand and from various performance
evaluation figures. The evaluation results demonstrate the performance gain
achieved by the implementation of the tuple set management concept on a
parallel mass storage system.
Keyword: parallel database, mass storage, parallel
I/O, pario-bib
Keyword: buffering, file caching, tertiary
storage, tape robot, file migration, parallel I/O, pario-bib
Comment: Ways to use secondary and tertiary
storage in parallel, and buffering mechanisms for applications with
concurrent I/O requirements.
Abstract: JUMP-1 is a distributed shared-memory
massively parallel computer and is composed of multiple clusters of
interconnected network called RDT (Recursive Diagonal Torus). Each cluster in
JUMP-1 consists of 4 element processors, secondary cache memories, and 2 MBP
(Memory Based Processor) for high-speed synchronization and communication
among clusters. The I/O subsystem is connected to a cluster via a high-speed
serial link called STAFF-Link. The I/O buffer memory is mapped onto the
JUMP-1 global shared-memory to permit each I/O access operation as memory
access. In this paper we describe evaluation of the fundamental performance
of the disk I/O subsystem using event-driven simulation, and estimated
performance with a Video On Demand (VOD) application.
Keyword: parallel I/O, I/O architecture, pario-bib
Abstract: This paper presents a measurement and
simulation based study of parallel I/O in a high-performance cluster system:
the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. The
measurements were used to characterize the performance bottlenecks and the
throughput limits at the compute and I/O nodes, and to provide realistic
input parameters to PioSim, a simulation environment we have developed to
investigate parallel I/O performance issues in cluster systems. PioSim was
used to obtain a detailed characterization of parallel I/O performance, in
the high performance cluster system, for different regular access patterns
and different system configurations. This paper also explores the use of
local disks at the compute nodes for parallel I/O, and finds that the local
disk architecture outperforms the traditional parallel I/O over remote I/O
node disks architecture, even when as much as 68-75\% of the requests from
each compute node goes to remote disks.
Keyword: performance analysis, parallel I/O,
pario-bib
Abstract: In out-of-core computations, disk
storage is treated as another level in the memory hierarchy, below cache,
local memory, and (in a parallel computer) remote memories. However the tools
used to manage this storage are typically quite different from those used to
manage access to local and remote memory. This disparity complicates
implementation of out-of-core algorithms and hinders portability. We describe
a programming model that addresses this problem. This model allows parallel
programs to use essentially the same mechanisms to manage the movement of
data between any two adjacent levels in a hierarchical memory system. We take
as our starting point the Global Arrays shared-memory model and library,
which support a variety of operations on distributed arrays, including
transfer between local and remote memories. We show how this model can be
extended to support explicit transfer between global memory and secondary
storage, and we define a Disk Resident Arrays Library that supports such
transfers. We illustrate the utility of the resulting model with two
applications, an out-of-core matrix multiplication and a large computational
chemistry program. We also describe implementation techniques on several
parallel computers and present experimental results that demonstrate that the
Disk Resident Arrays model can be implemented very efficiently on parallel
computers.
Keyword: parallel I/O, pario-bib
Abstract: Most current multiprocessor file
systems are designed to use multiple disks in parallel, using the high
aggregate bandwidth to meet the growing I/O requirements of parallel
scientific applications. Many multiprocessor file systems provide
applications with a conventional Unix-like interface, allowing the
application to access multiple disks transparently. This interface conceals
the parallelism within the file system, increasing the ease of
programmability, but making it difficult or impossible for sophisticated
programmers and libraries to use knowledge about their I/O needs to exploit
that parallelism. In addition to providing an insufficient interface, most
current multiprocessor file systems are optimized for a different workload
than they are being asked to support. We introduce Galley, a new parallel
file system that is intended to efficiently support realistic scientific
multiprocessor workloads. We discuss Galley's file structure and application
interface, as well as the performance advantages offered by that interface.
Keyword: verify month and pages, parallel file
system, parallel I/O, multiprocessor file system interface, pario-bib, dfk
Comment: A revised version of
nieuwejaar:jgalley-tr, which is a combination of nieuwejaar:galley and
nieuwejaar:galley-perf.
Abstract: Most current multiprocessor file
systems are designed to use multiple disks in parallel, using the high
aggregate bandwidth to meet the growing I/O requirements of parallel
scientific applications. Many multiprocessor file systems provide
applications with a conventional Unix-like interface, allowing the
application to access multiple disks transparently. This interface conceals
the parallelism within the file system, increasing the ease of
programmability, but making it difficult or impossible for sophisticated
programmers and libraries to use knowledge about their I/O needs to exploit
that parallelism. In addition to providing an insufficient interface, most
current multiprocessor file systems are optimized for a different workload
than they are being asked to support. We introduce Galley, a new parallel
file system that is intended to efficiently support realistic scientific
multiprocessor workloads. We discuss Galley's file structure and application
interface, as well as the performance advantages offered by that interface.
Keyword: parallel file system, parallel I/O,
multiprocessor file system interface, pario-bib, dfk
Abstract: Most current multiprocessor file
systems are designed to use multiple disks in parallel, using the high
aggregate bandwidth to meet the growing I/O requirements of parallel
scientific applications. Most multiprocessor file systems provide
applications with a conventional Unix-like interface, allowing the
application to access those multiple disks transparently. This interface
conceals the parallelism within the file system, increasing the ease of
programmability, but making it difficult or impossible for sophisticated
application and library programmers to use knowledge about their I/O to
exploit that parallelism. In addition to providing an insufficient interface,
most current multiprocessor file systems are optimized for a different
workload than they are being asked to support. In this work we examine
current multiprocessor file systems, as well as how those file systems are
used by scientific applications. Contrary to the expectations of the
designers of current parallel file systems, the workloads on those systems
are dominated by requests to read and write small pieces of data.
Furthermore, rather than being accessed sequentially and contiguously, as in
uniprocessor and supercomputer workloads, files in multiprocessor file
systems are accessed in regular, structured, but non-contiguous patterns.
Based on our observations of multiprocessor workloads, we have designed
Galley, a new parallel file system that is intended to efficiently support
realistic scientific multiprocessor workloads. In this work, we introduce
Galley and discuss its design and implementation. We describe Galley's new
three-dimensional file structure and discuss how that structure can be used
by parallel applications to achieve higher performance. We introduce several
new data-access interfaces, which allow applications to explicitly describe
the regular access patterns we found to be common in parallel file system
workloads. We show how these new interfaces allow parallel applications to
achieve tremendous increases in I/O performance. Finally, we discuss how
Galley's new file structure and data-access interfaces can be useful in
practice.
Keyword: parallel I/O, multiprocessor file system,
file system workload characterization, file access patterns, file system
interface, pario-bib
The design of a high-performance
multiprocessor file system requires a comprehensive understanding of the
expected workload. Unfortunately, until recently, no general workload studies
of multiprocessor file systems have been conducted. The goal of the CHARISMA
project was to remedy this problem by characterizing the behavior of several
production workloads, on different machines, at the level of individual reads
and writes. The first set of results from the CHARISMA project describe the
workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This
paper is intended to compare and contrast these two workloads for an
understanding of their essential similarities and differences, isolating
common trends and platform-dependent variances. Using this comparison, we are
able to gain more insight into the general principles that should guide
multiprocessor file-system design. Abstract: Phenomenal improvements in the
computational performance of multiprocessors have not been matched by
comparable gains in I/O system performance. This imbalance has resulted in
I/O becoming a significant bottleneck for many scientific applications. One
key to overcoming this bottleneck is improving the performance of
multiprocessor file systems.
Keyword: parallel I/O, file system workload,
workload characterization, file access pattern, multiprocessor file system,
dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided, ap:workload.
Abstract: We present an elegant deterministic
load balancing strategy for distribution sort that is applicable to a wide
variety of parallel disks and parallel memory hierarchies with both single
and parallel processors. The simplest application of the strategy is an
optimal deterministic algorithm for external sorting with multiple disks and
parallel processors. In each input/output (I/O) operation, each of the $D
\geq 1$ disks can simultaneously transfer a block of $B$ contiguous records.
Our two measures of performance are the number of I/Os and the amount of work
done by the CPU(s); our algorithm is simultaneously optimal for both
measures. We also show how to sort deterministically in parallel memory
hierarchies. When the processors are interconnected by any sort of a PRAM,
our algorithms are optimal for all parallel memory hierarchies; when the
interconnection network is a hypercube, our algorithms are either optimal or
best-known.
Comment: Short version of nodine:sort2 and
nodine:sortdisk.
Abstract: Existing parallel programming
environments for networks of workstations improve the performance of
computationally intensive applications by using message passing or virtual
shared memory to alleviate CPU bottlenecks. This paper describes an approach
based on message passing that addresses both CPU and I/O bottlenecks for a
specific class of distributed applications on ATM networks. ATM provides the
bandwidth required to utilize multiple I/O channels in parallel. This paper
also describes an environment based on distributed process management and
centralized application management that implements the approach. The
environment adds processes to a running application when necessary to
alleviate CPU and I/O bottlenecks while managing process connections in a
manner that is transparent to the application.
Keyword: parallel I/O, ATM, parallel networking,
pario-bib
A brief introduction to seismic processing will be
presented, and the implementation of a seismic-imaging code for distributed
memory computers will be discussed. The portable code, Salvo, performs a
wave-equation-based, 3-D, prestack, depth imaging and currently runs on the
Intel Paragon, the Cray T3D and SGI Challenge series. It uses MPI for
portability, and has sustained 22 Mflops/sec/proc (compiled FORTRAN) on the
Intel Paragon. Abstract: Fast, accurate imaging of complex,
oil-bearing geologies, such as overthrusts and salt domes, is the key to
reducing the costs of domestic oil and gas exploration. Geophysicists say
that the known oil reserves in the Gulf of Mexico could be significantly
increased if accurate seismic imaging beneath salt domes was possible. A
range of techniques exist for imaging these regions, but the highly accurate
techniques involve the solution of the wave equation and are characterized by
large data sets and large computational demands. Massively parallel computers
can provide the computational power for these highly accurate imaging
techniques.
Keyword: multiprocessor application, scientific
computing, seismic data processing, parallel I/O, pario-bib
Comment: 2 pages about their I/O scheme, mostly
regarding a calculation of the optimal balance between compute nodes and I/O
nodes to achieve perfect overlap.
Abstract: Scientific applications often require
some strategy for temporary data storage to do the largest possible
simulations. The use of virtual memory for temporary data storage has
received criticism because of performance problems. However, modern virtual
memory found in recent operating systems such as Cenju-3/DE give application
writers control over virtual memory policies. We demonstrate that custom
virtual memory policies can dramatically reduce virtual memory overhead and
allow applications to run out-of-core efficiently. We also demonstrate that
the main advantage of virtual memory, namely programming simplicity, is not
lost.
Keyword: virtual memory, file interface,
scientific applications, out-of-core, parallel I/O, pario-bib
Comment: Web and CDROM only.
Abstract: Hierarchical treecodes have, to a large
extent, converted the compute-bound N-body problem into a memory-bound
problem. The large ratio of DRAM to disk pricing suggests use of out-of-core
techniques to overcome memory capacity limitations. We will describe a
parallel, out-of-core treecode library, targeted at machines with independent
secondary storage associated with each processor. Borrowing the space-filling
curve techniques from our in-core library, and ``manually'' paging, results
in excellent spatial and temporal locality and very good performance.
Keyword: verify pages and month, parallel I/O, out
of core applications, scientific computing, pario-bib
Keyword: verify, parallel I/O, disk array, disk
striping, load balance, pario-bib
Comment: Updated version of scheuermann:partition.
Keyword: parallel I/O, collective I/O, pario-bib
This thesis presents a high-level interface
for array i/o and three implementation architectures, embodied in the Panda
(Persistence AND Arrays) array i/o library. The high-level interface
contributes to application portability, by encapsulating unnecessary details
and being easy to use. Performance results using Panda demonstrate that an
i/o system can provide application programs with a high-level, portable,
easy-to-use interface for array i/o without sacrificing performance or
requiring custom system software; in fact, combining all these benefits may
only be possible through a high-level interface due to the great freedom and
flexibility a high-level interface provides for the underlying
implementation. The Panda server-directed i/o architecture is a prime
example of an efficient implementation of collective array i/o for closely
synchronized applications in distributed-memory single-program multiple-data
(SPMD) environments. A high-level interface is instrumental to the good
performance of server-directed i/o, since it provides a global view of an
upcoming collective i/o operation that Panda uses to plan sequential reads
and writes. Performance results show that with server-directed i/o, Panda
achieves throughputs close to the maximum AIX file system throughput on the
i/o nodes of the IBM SP2 when reading and writing large multidimensional
arrays. Abstract: Multidimensional arrays are a
fundamental data type in scientific computing and are used extensively across
a broad range of applications. Often these arrays are persistent, i.e., they
outlive the invocation of the program that created them. Portability and
performance with respect to input and output (i/o) pose significant
challenges to applications accessing large persistent arrays, especially in
distributed-memory environments. A significant number of scientific
applications perform conceptually simple array i/o operations, such as
reading or writing a subarray, an entire array, or a list of arrays. However,
the algorithms to perform these operations efficiently on a given platform
may be complex and non-portable, and may require costly customizations to
operating system software.
Keyword: parallel I/O, persistent data, parallel
computing, pario-bib
Comment: see also chen:panda, seamons:panda,
seamons:compressed, seamons:interface, seamons:schemas, seamons:msio,
seamons:jpanda
Abstract: Current APIs for multiprocessor
multi-disk file systems are not easy to use in developing out-of-core
algorithms that choreograph parallel data accesses. Consequently, the
efficiency of these algorithms is hard to achieve in practice. We address
this deficiency by specifying an API that includes data-access primitives for
data choreography. With our API, the programmer can easily access specific
blocks from each disk in a single operation, thereby fully utilizing the
parallelism of the underlying storage system. Our API supports the
development of libraries of commonly-used higher-level routines such as
matrix-matrix addition, matrix-matrix multiplication, and BMMC
(bit-matrix-multiply/complement) permutations. We illustrate our API in
implementations of these three high-level routines to demonstrate how easy it
is to use.
Keyword: parallel I/O, multiprocessor file system
interface, pario-bib
Comment: Also published as Courant Institute Tech
Report 708.
Abstract: We estimate the performance of a
network-wide concurrent file system implemented using conventional disks as
disk arrays. Tests were carried out on both single system and network-wide
environments. On single systems, a file was split across several disks to
test the performance of file I/O operations. We concluded that performance
was proportional to the number of disks, up to four, on a system with high
computing power. Performance of a system with low computing power, however,
did not increase, even with more than two disks. When we split a file across
disks in a network-wide system called the Network-wide Concurrent File System
(N-CFS), we found performance similar to or slightly higher than that of disk
arrays on single systems. Since file access through N-CFS is transparent,
this system enables traditional disks on single and networked systems to be
used as disk arrays for I/O intensive jobs.
Keyword: mass storage, cluster computing,
distributed file system, parallel I/O, pario-bib
Abstract: The modest I/O configurations and file
system limitations of many current high-performance systems preclude solution
of problems with large I/O needs. I/O hardware and file system parallelism is
the key to achieving high performance. We analyze the I/O behavior of several
versions of two scientific applications on the Intel Paragon XP/S. The
versions involve incremental application code enhancements across multiple
releases of the operating system. Studying the evolution of I/O access
patterns underscores the interplay between application access patterns and
file system features. Our results show that both small and large request
sizes are common, that at present, application developers must manually
aggregate small requests to obtain high disk transfer rates, that concurrent
file accesses are frequent, and that appropriate matching of the application
access pattern and the file system access mode can significantly increase
application I/O performance. Based on these results, we describe a set of
file system design principles.
Keyword: I/O, workload characterization,
scientific computing, parallel I/O, pario-bib
Comment: They study two applications over several
versions, using Pablo to capture the I/O activity. They thus watch as
application developers improve the applications use of I/O modes and request
sizes. Both applications move through three phases: initialization,
computation (with out-of-core I/O or checkpointing I/O), and output. They
found it necessary to tune the I/O request sizes to match the parameters of
the I/O system. In the initial versions, the code used small read and write
requests, which were (according to the developers) the "easiest and most
natural implementation for their I/O." They restructured the I/O to make
bigger requests, which better matched the capabilities of Intel PFS. They
conclude that asynchronous and collective operations are imperative. They
would like to see a file system that can adapt dynamically to adjust its
policies to the apparent access patterns. Automatic request aggregation of
some kind seems like a good idea; of course, that is one feature of a buffer
cache.
Abstract: High performance servers and high-speed
networks will form the backbone of the infra-structure required for
distributed multimedia information systems. Given that the goal of such a
server is to support hundreds of interactive data streams simultaneously,
various tradeoffs are possible with respect to the storage of data on
secondary memory, and its retrieval therefrom. In this paper we identify and
evaluate these tradeoffs. We evaluate the effect of varying the stripe factor
and also the performance of batched retrieval of disk-resident data. We
develop a methodology to predict the stream capacity of such a server. The
evaluation is done for both uniform and skewed access patterns. Experimental
results on the Intel Paragon computer are presented.
Keyword: verify pages, threads, parallel I/O,
pario-bib
This
thesis presents an efficient and portable implementation of the Panda array
I/O library. In this implementation, standard software components are used to
build the I/O library to aid its portability. The implementation also
provides a simple, flexible framework for the implementation and integration
of the various collective I/O strategies. The server directed I/O and the
reduced messages server directed I/O algorithms are implemented in the Panda
array I/O library. This implementation supports the sharing of the I/O
servers between multiple applications by extending the collective I/O
strategies. Also, the implementation supports the use of part time I/O nodes
where certain designated compute nodes act as the I/O servers during the I/O
phase of the application. The performance of this implementation of the Panda
array I/O library is measured on the IBM SP2 and the performance results show
that for read and write operations, the collective I/O strategies used by the
Panda array I/O library achieve throughputs close to the maximum throughputs
provided by the underlying file system on each I/O node of the IBM SP2.
Abstract: Parallel computers are a cost effective
approach to providing significant computational resources to a broad range of
scientific and engineering applications. Due to the relatively lower
performance of the I/O subsystems on these machines and due to the
significant I/O requirements of these applications, the I/O performance can
become a major bottleneck. Optimizing the I/O phase of these applications
poses a significant challenge. A large number of these scientific and
engineering applications perform simple operations on multidimensional arrays
and providing an easy and efficient mechanism for implementing these
operations is important. The Panda array I/O library provides simple high
level interfaces to specify collective I/O operations on multidimensional
arrays in a distributed memory single-program multiple-data (SPMD)
environment. The high level information provided by the user through these
interfaces allows the Panda array I/O library to produce an efficient
implementation of the collective I/O request. The use of these high level
interfaces also increases the portability of the application.
Keyword: parallel I/O, multiprocessor file system,
pario-bib
Abstract: In this paper, we propose a strategy
for implementing parallel-I/O interfaces portably and efficiently. We have
defined an abstract-device interface for parallel I/O, called ADIO. Any
parallel-I/O API can be implemented on multiple file systems by implementing
the API portably on top of ADIO, and implementing only ADIO on different file
systems. This approach simplifies the task of implementing an API and yet
exploits the specific high-performance features of individual file systems.
We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO
and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our
performance studies indicate that the overhead of using ADIO as an
implementation strategy is very low.
Keyword: parallel I/O, multiprocessor file system
interface, pario-bib
Keyword: multiprocessor file system interface,
parallel I/O, pario-bib
Comment: They propose an intermediate interface
that can serve as an implementation base for all parallel file-system APIs,
and which can itself be implemented on top of all parallel file systems. This
``universal'' interface allows all apps to run on all file systems with no
porting, and for people to experiment with different APIs.
Abstract: This paper presents the results of an
experimental evaluation of the parallel I/O systems of the IBM SP and Intel
Paragon. For the evaluation, we used a full, three-dimensional application
code that is in production use for studying the nonlinear evolution of Jeans
instability in self-gravitating gaseous clouds. The application performs I/O
by using library routines that we developed and optimized separately for
parallel I/O on the SP and Paragon. The I/O routines perform two-phase I/O
and use the PIOFS file system on the SP and PFS on the Paragon. We studied
the I/O performance for two different sizes of the application. We found that
for the small case, I/O was faster on the SP, whereas for the large case, I/O
took almost the same time on both systems. Communication required for I/O was
faster on the Paragon in both cases. The highest read bandwidth obtained was
48 Mbytes/sec. and the highest write bandwidth obtained was 31.6 Mbytes/sec.,
both on the SP.
Keyword: parallel I/O, multiprocessor file system,
pario-bib
Comment: This version no longer on the web.
Abstract: A number of applications on parallel
computers deal with very large data sets that cannot fit in main memory. In
such applications, data must be stored in files on disks and fetched into
memory during program execution. Parallel programs with large out-of-core
arrays stored in files must read/write smaller sections of the arrays from/to
files. In this article, we describe a method for accessing sections of
out-of-core arrays efficiently. Our method, the extended two-phase method,
uses collective I/O: Processors cooperate to combine several I/O requests
into fewer larger granularity requests, reorder requests so that the file is
accessed in proper sequence, and eliminate simultaneous I/O requests for the
same data. In addition, the I/O workload is divided among processors
dynamically, depending on the access requests. We present performance results
obtained from two real out-of-core parallel applications--matrix
multiplication and a Laplace's equation solver--and several synthetic access
patterns, all on the Intel Touchstone Delta. These results indicate that the
extended two-phase method significantly outperformed a direct (noncollective)
method for accessing out-of-core array sections.
Keyword: parallel I/O, pario-bib
Keyword: parallel I/O, pario-bib
Comment: See thakur:passion, choudhary:passion.
Abstract: The Panda Array I/O library, created at
the University of Illinois, Urbana-Champaign, was built especially to address
the needs of high-performance scientific applications. I/O has been one of
the most frustrating bottlenecks to high performance for quite some time, and
the Panda project is an attempt to ameliorate this problem while still
providing the user with a simple, high-level interface. The Galley File
System, with its hierarchical structure of files and strided requests, is
another attempt at addressing the performance problem. My project was to
redesign