Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Describes a parallel IBM/370, where they
attach several small 370s to a switch, and several disks to each 370. Not
much in the way of striping.
Abstract: As computation and communication
hardware performance continue to rapidly increase, I/O represents a growing
fraction of application execution time. This gap between the I/O subsystem
and others is expected to increase in future since I/O performance is limited
by physical motion. Therefore, it is imperative that novel techniques for
improveing I/O performance be developed. Parallel I/O is a promising approach
to alleviating this bottleneck. However, very little work exist with respect
to scheduling parallel I/O operations explicitly. In this paper, we address
the problem of effective management of parallel I/O in cluster computing
systems by using appropriate I/O scheduling strategies. We propose two new
I/O scheduling algorithms and compare them with two existing scheduling
Approaches. The preliminary results show that the proposed policies
outperform existing policies substantially.
Keywords: parallel I/O, I/O scheduling algorithms,
pario-bib
Keywords: parallel I/O, out-of-core algorithm,
pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Keywords: out-of-core algorithm, graph, pario-bib
Keywords: parallel I/O, I/O, pario-bib
Comment: Derives formulas for the speedup with and
without I/O considered and with parallel software and hardware format
conversion. Considering I/O gives a more optimistic view of the speedup of a
program assuming that the parallel version can use its I/O bandwidth as
effectively as the serial processor. Concludes that, for a given number of
processors, increasing the I/O bandwidth is the most effective way to speed
up the program (over the format conversion improvements).
Abstract: Getting good I/O performance from
parallel programs is a critical problem for many application domains. In this
paper, we report our experience tuning the I/O performance of four
application programs from the areas of satellite-data processing and linear
algebra. After tuning, three of the four applications achieve
application-level I/O rates of over 100 MB/s on 16 processors. The total
volume of I/O required by the programs ranged from about 75 MB to over 200
GB. We report the lessons learned in achieving high I/O performance from
these applications, including the need for code restructuring, local disks on
every node and knowledge of future I/O requests. We also report our
experience on achieving high performance on peer-to-peer configurations.
Finally, we comment on the necessity of complex I/O interfaces like
collective I/O and strided requests to achieve high performance.
Keywords: parallel I/O, filesystem workload,
parallel application, pario-bib
Abstract: We provide tight upper and lower bounds,
up to a constant factor, for the number of inputs and outputs (I/Os) between
internal memory and secondary storage required for five sorting-related
problems: sorting, the fast Fourier transform (FFT), permutation networks,
permuting, and matrix transposition. The bounds hold both in the worst case
and in the average case, and in several situations the constant factors
match. Secondary storage is modeled as a magnetic disk capable of transfering
$P$ blocks each containing $B$ records in a single time unit; the records in
each block must be input from or output to $B$ contiguous locations on the
disk. We give two optimal algorithms for the problems, which are variants of
merge sorting and distribution sorting. In particular we show for $P=1$ that
the standard merge sorting algorithm is an optimal external sorting method,
up to a constant factor in the number of I/Os. Our sorting algorithms use the
same number of I/Os as does the permutation phase of key sorting, except when
the internal memory size is extremely small, thus affirming the popular adage
that key sorting is not faster. We also give a simpler and more direct
derivation of Hong and Kung's lower bound for the FFT for the special case $B
= P = O(1)$.
Keywords: parallel I/O, sorting, pario-bib
Comment: Good comments on typical external sorts,
and how big they are. Focuses on parallelism at the disk. They give tight
theoretical bounds on the number of I/O's required to do external sorting and
other problems (FFTs, matrix transpose, etc.). If $x$ is the number of blocks
in the file and $y$ is the number of blocks that fit in memory, then the
number of I/Os needed grows as $Θ (x \log x / \log y)$. If parallel
transfers of $p$ blocks are allowed, speedup linear in $p$ is obtained.
Keywords: compiler, I/O, pario-bib
Comment: Not really about parallel applications or
parallel I/O, but I think it may be of interest to that community. They
propose a compiler framework for a compiler to insert asynchronous I/O
operations (start I/O, finish I/O), to satisfy the dependency constraints of
the program.
Abstract: The motivation for the research
presented here is to develop an approach for scheduling I/O operations in
distributed/parallel computer systems. First, a general model for specifying
the parallel I/O scheduling problem is developed. The model defines the I/O
bandwidth for different parallel/distributed architectures. Then the model is
used to establish an algorithm for scheduling I/O operations on these
architectures.
Keywords: parallel I/O, scheduling, pario-bib
Abstract: This paper describes the performance
improvements achieved by a data-intensive application by controlling the
storage policies and algorithms of a distributed storage system. The Network
Storage Manager (NSM) is a mass distributed storage framework with a unique
architecture that provides applications with the high-performance features
they need. It also provides the standard most commonly used implementation
for storage policies. Distributed Terrain Viewer (DTViewer) is an application
that utilizes NSM architecture and for efficient and reliable data delivery.
Moreover, it exploits NSM controllable architecture by plugging-in its
application-specific optimized implementations. DTViewer overrides the
default NSM policies that do not understand its sophisticated access
patterns, partitioning, and storage layout requirements. Experimental results
have show significant improvement achieved when the application-tailored
implementation are used. Such speedups are not achievable on storage systems
with no application control such as the Parallel Virtual File System PVFS.
(44 Refs.)
Keywords: application-specific storage policies,
pario-app, DTViewer, access patterns, data layout, pario-bib
Abstract: An emerging class of data-intensive
applications involve the geographically dispersed extraction of complex
scientific information from very large collections of measured or computed
data. Such applications arise, for example, in experimental physics, where
the data in question is generated by accelerators, and in simulation science,
where the data is generated by supercomputers. So-called Data Grids provide
essential infrastructure for such applications, much as the Internet provides
essential services for applications such as e-mail and the Web. We describe
here two services that we believe are fundamental to any Data Grid: reliable,
high-speed transport and replica management. Our high-speed transport
service, GridFTP, extends the popular FTP protocol with new features required
for Data Grid applications, such as striping and partial file access. Our
replica management service integrates a replica catalog with GridFTP
transfers to provide for the creation, registration, location, and management
of dataset replicas. We present the design of both services and also
preliminary performance results. Our implementations exploit security and
other services provided by the Globus Toolkit.
Keywords: computational grid, data transfer,
network, I/O, pario-bib
Abstract: We are developing a system for
collaborative research and development for a distributed group of researchers
at different institutions around the world. In a new paradigm for
collaborative computational science, the computer code and supporting
infrastructure itself becomes the collaborating instrument, just as an
accelerator becomes the collaborating tool for large numbers of distributed
researchers in particle physics. The design of this "Collaboratory" allows
many users, with very different areas of expertise, to work coherently
together, on distributed computers around the world. Different supercomputers
may be used separately, or for problems exceeding the capacity of any single
system, multiple supercomputers may be networked together through high speed
gigabit networks. Central to this Collaboratory is a new type of community
simulation code, called "Cactus". The scientific driving force behind this
project is the simulation of Einstein's equations for studying black holes,
gravitational waves, and neutron stars, which has brought together
researchers in very different fields from many groups around the world to
make advances in the study of relativity and astrophysics. But the system is
also being developed to provide scientists and engineers, without expert
knowledge of parallel or distributed computing, mesh refinement, and so on,
with a simple framework for solving any system of partial differential
equations on many parallel computer systems, from traditional supercomputers
to networks of workstations.
Keywords: scientific application, grid,
input/output, parallel-io, pario-bib
Comment: invited talk. They describe a
computational toolkit (CACTUS) that allows developers to construct code
modules (thorns) to plug into the core system (cactus flesh). The toolkit
includes thorns for solving partial differential equations using MPI,
parallel elliptic solvers, thorns for I/O using FlexIO or HDF5, and thorns
for checkpointing. The talk showed results from a cactus code demo that ran
at SC'98. The demo combined two tightly-connected supercomputers (one in
Europe and one in America) using Globus to simulate the collision of two
neutron stars.
Keywords: fault tolerance, RAID, disk array,
parallel I/O, pario-bib
Abstract: Enterprise-scale storage systems, which
can contain hundreds of host computers and storage devices and up to tens of
thousands of disks and logical volumes, are difficult to design. The volume
of choices that need to be made is massive, and many choices have unforeseen
interactions. Storage system design is tedious and complicated to do by hand,
usually leading to solutions that are grossly over-provisioned, substantially
under-performing or, in the worst case, both.To solve the configuration
nightmare, we present minerva: a suite of tools for designing storage systems
automatically. Minerva uses declarative specifications of application
requirements and device capabilities; constraint-based formulations of the
various sub-problems; and optimization techniques to explore the search space
of possible solutions.This paper also explores and evaluates the design
decisions that went into Minerva, using specialized micro- and
macro-benchmarks. We show that Minerva can successfully handle a workload
with substantial complexity (a decision-support database benchmark). Minerva
created a 16-disk design in only a few minutes that achieved the same
performance as a 30-disk system manually designed by human experts. Of equal
importance, Minerva was able to predict the resulting system's performance
before it was built.
Keywords: disk array, storage system, RAID,
automatic design, parallel I/O, pario-bib
Keywords: parallel architecture, MIMD, NUMA,
pario-bib
Comment: Interesting architecture. 3-d mesh of
pipelined packet-switch nodes, e.g., 16x16x16 is 4096 nodes, with 256 procs,
512 memory units, 256 I/O cache units, and 256 I/O processors attached. 2816
remaining nodes are just switching nodes. Each processor is 64-bit custom
chip with up to 128 simultaneous threads in execution. It alternates between
ready threads, with a deep pipeline. Inter-instruction dependencies
explicitly encoded by the compiler, stalling those threads until the
appropriate time. Each thread has a complete set of registers! Memory units
have 4-bit tags on each word, for full/empty and trap bits. Shared memory
across the network: ``The Tera ISP-level architecture is UMA, even though the
PMS-level architecture is NUMA. Put another way, the memory looks a single
cycle away to the compiler writer.'' - Burton Smith. See also tera:brochure.
Keywords: file caching, distributed file system,
pario-bib
Comment: Part of jin:io-book; reformatted version
of anderson:serverless.
In this paper, we demonstrate
the need for timing accuracy for I/O benchmarking in the context of replaying
I/O traces. We also quantitatively characterize the impact of error in
issuing I/Os on measured system parameters. For instance, we show that the
error in perceived I/O response times can be as much as +350% or -15% by
using naive benchmarking tools that have timing inaccuracies. To address this
problem, we present Buttress, a portable and flexible toolkit that can
generate I/O workloads with microsecond accuracy at the I/O throughputs of
high-end enterprise storage arrays. In particular, Buttress can issue I/O
requests within 100µs of the desired issue time even at rates of 10000 I/Os
per second (IOPS). Abstract: In benchmarking I/O systems, it is
important to generate, accurately, the I/O access pattern that one is
intending to generate. However, timing accuracy ( issuing I/Os at the desired
time) at high I/O rates is difficult to achieve on stock operating systems.
We currently lack tools to easily and accurately generate complex I/O
workloads on modern storage systems. As a result, we may be introducing
substantial errors in observed system metrics when we benchmark I/O systems
using inaccurate tools for replaying traces or for producing synthetic
workloads with known inter-arrival times.
Keywords: benchmarking software, performance
analysis, I/O access patterns, I/O workloads, pario-bib
Comment: Looks like a really cool piece of
software. Generates I/O workloads by replaying I/O traces.
Abstract: Disk arrays have a myriad of
configuration parameters that interact in counter-intuitive ways, and those
interactions can have significant impacts on cost, performance, and
reliability. Even after values for these parameters have been chosen, there
are exponentially-many ways to map data onto the disk arrays' logical units.
Meanwhile, the importance of correct choices is increasing: storage systems
represent an growing fraction of total system cost, they need to respond more
rapidly to changing needs, and there is less and less tolerance for mistakes.
We believe that automatic design and configuration of storage systems is the
only viable solution to these issues. To that end, we present a comparative
study of a range of techniques for programmatically choosing the RAID levels
to use in a disk array. Our simplest approaches are modeled on existing,
manual rules of thumb: they "tag" data with a RAID level before determining
the configuration of the array to which it is assigned. Our best approach
simultaneously determines the RAID levels for the data, the array
configuration, and the layout of data on that array. It operates as an
optimization process with the twin goals of minimizing array cost while
ensuring that storage workload performance requirements will be met. This
approach produces robust solutions with an average cost/performance
14-17{PCT} better than the best results for the tagging schemes, and up to
150-200{PCT} better than their worst solutions. We believe that this is the
first presentation and systematic analysis of a variety of novel,
fully-automatic RAID-level selection techniques.
Keywords: file systems, pario-bib
Keywords: file caching, distributed file system,
pario-bib
Comment: See anderson:serverless-sosp.
Recent parallel file-system usage
studies show that writes to write-only files are a dominant part of the
workload. Therefore, optimizing writes could have a significant impact on
overall performance. In this paper, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. ENWRICH
combines low-overhead write caching at the compute processors with high
performance disk-directed I/O at the I/O processors to achieve both low
latency and high bandwidth. This combination facilitates the use of the
powerful disk-directed I/O technique independent of any particular choice of
interface. By collecting writes over many files and applications, ENWRICH
lets the I/O processors optimize disk I/O over a large pool of requests. We
evaluate our design via simulated implementation and show that ENWRICH
achieves high performance for various configurations and workloads.
Abstract: Many parallel scientific applications
need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance
has not been able to keep up with substantial improvements in parallel-I/O
hardware because of poor parallel file-system software. Many radical changes,
both at the interface level and the implementation level, have recently been
proposed. One such proposed interface is collective I/O, which allows
parallel jobs to request transfer of large contiguous objects in a single
request, thereby preserving useful semantic information that would otherwise
be lost if the transfer were expressed as per-processor non-contiguous
requests. Kotz has proposed disk-directed I/O as an efficient
implementation technique for collective-I/O operations, where the compute
processors make a single collective data-transfer request, and the I/O
processors thereafter take full control of the actual data transfer,
exploiting their detailed knowledge of the disk-layout to attain
substantially improved performance.
Keywords: parallel file system, parallel I/O,
caching, pario-bib, dfk
Recent parallel file-system usage
studies show that writes to write-only files are a dominant part of the
workload. Therefore, optimizing writes could have a significant impact on
overall performance. In this paper, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. ENWRICH
combines low-overhead write caching at the compute processors with high
performance disk-directed I/O at the I/O processors to achieve both low
latency and high bandwidth. This combination facilitates the use of the
powerful disk-directed I/O technique independent of any particular choice of
interface. By collecting writes over many files and applications, ENWRICH
lets the I/O processors optimize disk I/O over a large pool of requests. We
evaluate our design via simulated implementation and show that ENWRICH
achieves high performance for various configurations and workloads.
Abstract: Many parallel scientific applications
need high-performance I/O. Unfortunately, end-to-end parallel-I/O performance
has not been able to keep up with substantial improvements in parallel-I/O
hardware because of poor parallel file-system software. Many radical changes,
both at the interface level and the implementation level, have recently been
proposed. One such proposed interface is collective I/O, which allows
parallel jobs to request transfer of large contiguous objects in a single
request, thereby preserving useful semantic information that would otherwise
be lost if the transfer were expressed as per-processor non-contiguous
requests. Kotz has proposed disk-directed I/O as an efficient
implementation technique for collective-I/O operations, where the compute
processors make a single collective data-transfer request, and the I/O
processors thereafter take full control of the actual data transfer,
exploiting their detailed knowledge of the disk-layout to attain
substantially improved performance.
Keywords: parallel file system, parallel I/O,
caching, pario-bib, dfk
Abstract: High-performance parallel file systems
are needed to satisfy tremendous I/O requirements of parallel scientific
applications. The design of such parallel file systems depends on a
comprehensive understanding of the expected workload, but so far there have
been very few usage studies of multiprocessor file systems. In the first part
of this dissertation, we attempt to fill this void by measuring a real
file-system workload on a production parallel machine, namely the CM-5 at the
National Center for Supercomputing Applications. We collect information about
nearly every individual I/O request from the mix of jobs running on the
machine. Analysis of the traces leads to various recommendations for design
of future parallel file systems. Our usage study showed that writes to
write-only files are a dominant part of the workload. Therefore, optimizing
writes could have a significant impact on overall performance. In the second
part of this dissertation, we propose ENWRICH, a compute-processor
write-caching scheme for write-only files in parallel file systems. Within
its framework, ENWRICH uses a recently proposed high performance
implementation of collective I/O operations called disk-directed I/O, but it
eliminates a number of limitations of disk-directed I/O. ENWRICH combines
low-overhead write caching at the compute processors with high performance
disk-directed I/O at the I/O processors to achieve both low latency and high
bandwidth. This combination facilitates the use of the powerful disk-directed
I/O technique independent of any particular choice of interface, and without
the requirement for mapping libraries at the I/O processors. By collecting
writes over many files and applications, ENWRICH lets the I/O processors
optimize disk I/O over a large pool of requests. We evaluate our design of
ENWRICH using simulated implementation and extensive experimentation. We show
that ENWRICH achieves high performance for various configurations and
workloads. We pinpoint the reasons for ENWRICH`s failure to perform well for
certain workloads, and suggest possible enhancements. Finally, we discuss the
nuances of implementing ENWRICH on a real platform and speculate about
possible adaptations of ENWRICH for emerging multiprocessing platforms.
Keywords: parallel I/O, multiprocessor file
system, file access patterns, workload characterization, file caching,
disk-directed I/O, pario-bib
Comment: See also ap:enwrich, ap:workload, and
nieuwejaar:workload
Abstract: High-performance parallel file systems
are needed to satisfy tremendous I/O requirements of parallel scientific
applications. The design of such high-performance parallel file systems
depends on a comprehensive understanding of the expected workload, but so far
there have been very few usage studies of multiprocessor file systems. This
paper is part of the CHARISMA project, which intends to fill this void by
measuring real file-system workloads on various production parallel machines.
In particular, here we present results from the CM-5 at the National Center
for Supercomputing Applications. Our results are unique because we collect
information about nearly every individual I/O request from the mix of jobs
running on the machine. Analysis of the traces leads to various
recommendations for parallel file-system design.
Keywords: parallel I/O, file access pattern,
multiprocessor file system, file system workload, dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided.
Design of such
high-performance parallel file systems depends on a thorough grasp of the
expected workload. So far there have been no comprehensive usage studies of
multiprocessor file systems. Our CHARISMA project intends to fill this void.
The first results from our study involve an iPSC/860 at NASA Ames. This paper
presents results from a different platform, the CM-5 at the National Center
for Supercomputing Applications. The CHARISMA studies are unique because we
collect information about every individual read and write request and about
the entire mix of applications running on the machines. The results of
our trace analysis lead to recommendations for parallel file system design.
First, the file system should support efficient concurrent access to many
files, and I/O requests from many jobs under varying load condit ions.
Second, it must efficiently manage large files kept open for long periods.
Third, it should expect to see small requests, predominantly sequential
access patterns, application-wide synchronous access, no concurrent
file-sharing between jobs, appreciable byte and block sharing between
processes within jobs, and strong interprocess locality. Finally, the trace
data suggest that node-level write caches and collective I/O request
interfaces may be useful in certain environments. Abstract: Rapid increases in the computational
speeds of multiprocessors have not been matched by corresponding performance
enhancements in the I/O subsystem. To satisfy the large and growing I/O
requirements of some parallel scientific applications, we need parallel file
systems that can provide high-bandwidth and high-volume data transfer between
the I/O subsystem and thousands of processors.
Keywords: parallel I/O, file access pattern,
multiprocessor file system, file system workload, dfk, pario-bib
Comment: See also kotz:workload,
nieuwejaar:strided.
Keywords: parallel file system, parallel I/O,
Intel iPSC/2, pario-bib
Comment: Studies the performance of Intel CFS.
Uses an application that reads in a huge file of records, each a genome
sequence, and compares each sequence against a given sequence. Looks at cache
performance, message latency, cost of prefetches and directory reads, and
throughput. He tries one-disk, one-proc transfer rates. Because of contention
with the directory server on one of the two I/O nodes, it was faster to put
all of the file on the other I/O node. Striping is good for multiple readers.
Best access pattern was interleaved, not segmented or separate files, because
it avoided disk seeks. Perhaps the files are stored contiguously? Can get
good speedup by reading the sequences in big integral record sizes, from CFS,
using a load-balancing scheduled by the host. Contention for directory blocks
- through single-node directory server.
Abstract: The paper presents a survey of the basic
paradigms for designing efficient external-memory algorithms and especially
for designing external-memory algorithms for computational geometry problems
with applications in GIS. As the area of external-memory algorithms is
relatively young the paper focuses on fundamental external-memory design
techniques more than on algorithms for specific GIS problems. The
presentation is survey-like with a more detailed discussion of the most
important techniques and algorithms.
Keywords: out-of-core algorithm, geographic
information system, GIS, pario-bib
Comment: not parallel? but mentions some parallel
disk stuff.
In this paper we
develop efficient new external-memory algorithms for a number of important
problems involving line segments in the plane, including trapezoid
decomposition, batched planar point location, triangulation, red-blue line
segment intersection reporting, and general line segment intersection
reporting. In GIS systems, the first three problems are useful for rendering
and modeling, and the latter two are frequently used for overlaying maps and
extracting information from them. To solve these problems, we combine and
modify in novel ways several of the previously known techniques for designing
efficient algorithms for external memory. We also develop a powerful new
technique that can be regarded as a practical external memory version of
fractional cascading. Except for the batched planar point location problem,
no algorithms specifically designed for external memory were previously known
for these problems. Our algorithms for triangulation and line segment
intersection partially answer previously posed open problems, while the
batched planar point location algorithm improves on the previously known
solution, which applied only to monotone decompositions. Our algorithm for
the red-blue line segment intersection problem is provably optimal.
Abstract: We present a set of algorithms designed
to solve large-scale geometric problems involving collections of line
segments in the plane. Geographical information systems (GIS) handle large
amounts of spatial data, and at some level the data is often manipulated as
collections of line segments. NASA's EOS project is an example of a GIS that
is expected to store and manipulate petabytes (thousands of terabytes, or
millions of gigabytes) of data! In the design of algorithms for this type of
large-scale application, it is essential to consider the problem of
minimizing I/O communication, which is the bottleneck.
Keywords: verify, out-of-core algorithm,
computational geometry, pario-bib
Comment: Special issue on cartography and
geographic information systems.
Keywords: out-of-core algorithm, computational
geometry, pario-bib
Comment: See also the component papers
vitter:survey, arge:lower, crauser:segment, grossi:crosstrees, toledo:survey.
Not clear to what extent these papers are about *parallel* I/O.
Abstract: In the design of algorithms for
large-scale applications it is essential to consider the problem of
minimizing I/O communication. Geographical information systems (GIS) are good
examples of such large-scale applications as they frequently handle huge
amounts of spatial data. In this paper we develop efficient new
external-memory algorithms for a number of important problems involving line
segments in the plane, including trapezoid decomposition, batched planar
point location, triangulation, red-blue line segment intersection reporting,
and general line segment intersection reporting. In GIS systems, the first
three problems are useful for rendering and modeling, and the latter two are
frequently used for overlaying maps and extracting information from them.
Keywords: out-of-core algorithm, computational
geometry, pario-bib
Comment: Does deal with parallel disks, though not
in great detail.
Abstract: We investigate the I/O complexity of the
problem of sorting sequences (or strings of characters) in external memory,
which is a fundamental component of many large-scale text applications. In
the standard unit-cost RAM comparison model, the complexity of sorting K
strings of total length N is Theta (K log/sub 2/ K+N). By analogy, in the
external memory (or I/O) model, where the internal memory has size M and the
block transfer size is B, it would be natural to guess that the I/O
complexity of sorting sequences is Theta ((K/B)log/sub M/B/(K/B)+(N/B)), but
the known algorithms do not come even close to achieving this bound. Our
results show, somewhat counterintuitively, that the I/O complexity of string
sorting depends upon the length of the strings relative to the block size. We
first consider a simple comparison I/O model, where the strings are not
allowed to be broken into their individual characters, and we show that the
I/O complexity of string sorting in this model is Theta ((N/sub 1//B)log/sub
M/B/(N/sub 1//B)+K/sub 2/+(N/B)), where N/sub 1/ is the total length of all
strings shorter than B and K/sub 2/ is the number of strings longer than B.
We then consider two more general I/O comparison models in which string
breaking is allowed. We obtain improved algorithms and in several cases lower
bounds that match their I/O bounds. Finally, we develop more practical
algorithms outside the comparison model.
Keywords: out-of-core algorithm, sorting
algorithm, pario-bib
Comment: This paper is really the same paper as
arge:sorting-strings.
Abstract: In this paper we address for the first
time the I/O complexity of the problem of sorting strings in external memory,
which is a fundamental component of many large-scale text applications. In
the standard unit-cost RAM comparison model, the complexity of sorting K
strings of total length N is theta(K log K + N). By analogy, in the external
memory (or I/O) model, where the internal memory has size M and the block
transfer size is B, it would be natural to guess that the I/O complexity of
sorting strings is $θ(K/B log_(M/B) (K/B) + N/B)$, but the known
algorithms do not come even close to achieving this bound. Our results show,
somewhat counterintuitively, that the I/O complexity of string sorting
depends upon the length of the strings relative to the block size. We first
consider a simple comparison I/O model, where one is not allowed to break the
strings into their characters, and we show that the I/O complexity of string
sorting in this model is $θ(N_1/B log_(M/B) (N_1/B) + K_2 log_(M/B) K_2
+ N/B)$, where $N_1$ is the total length of all strings shorter than B and
$K_2$ is the number of strings longer than B. We then consider two more
general I/O comparison models in which string breaking is allowed. We obtain
improved algorithms and in several cases lower bounds that match their I/O
bounds. Finally, we develop more practical algorithms without assuming the
comparison model.
Keywords: out-of-core algorithm, sorting, parallel
I/O, pario-bib
Comment: Not parallel? But mentions some parallel
disk stuff.
In this paper we provide evidence that the single-disk model
is strictly more powerful. We prove a lower bound on any general simulation
of the single-disk model on the multi-disk model and establish randomized and
deterministic upper bounds. Let $N$ be the problem size and let $T$ be the
number of parallel I/Os required by a program on the single-disk model. Then
any simulation of this program on the multi-disk model will require
$Ω\left(T \frac{\log(N/D)}{\log \log(N/D)}\right)$ parallel I/Os. This
lower bound holds even if replication is allowed in the multi-disk model. We
also show an $O\left(\frac{\log D}{\log \log D}\right)$ randomized upper
bound and an $O\left(\log D (\log \log D)^2\right)$ deterministic upper
bound. These results exploit an interesting analogy between the disk models
and the PRAM and DCM models of parallel computation. Abstract: The single-disk, D-head model of
parallel I/O was introduced by Agarwal and Vitter to analyze algorithms for
problem instances that are too large to fit in primary memory. Subsequently
Vitter and Shriver proposed a more realistic model in which the disk space is
partitioned into D disks, with a single head per disk. To date, each problem
for which there is a known optimal algorithm for both models has the same
asymptotic bounds on both models. Therefore, it has been unknown whether the
models are equivalent or whether the single-disk model is strictly more
powerful.
Keywords: parallel I/O, theory, parallel I/O
algorithm, pario-bib
Keywords: distributed query processing, dataflow,
pario-bib
Comment: River is a dataflow programming
environment for database query processing applications. River is specifically
designed for clusters of computers with heterogeneous performance
characteristics. The goal of the River runtime system is to adapt to
"performance faults"-portions of the system that perform poorly by
dynamically adjusting the transfer of data through the dataflow graph. River
uses two constructs to build applications: a distributed queue that deals
with performance faults by consumers, and graduated declustering that deals
with performance faults of producers. A distributed queue pushes data through
the dataflow graph at a rate proportional to the rate of consumption and
adapts to changes in consumption rates. Graduated declustering deals with
producer performance faults by reading from replicated producers. Although
River is designed specifically for query processing, they briefly discuss how
one might adapt scientific applications to work in their framework.
Abstract: We introduce River, a data-flow
programming environment and I/O substrate for clusters of computers. River is
designed to provide maximum performance in the common case- even in the
face of non-uniformities in hardware, software, and workload. River is based
on two simple design features: a high-performance distributed queue,and a
storage redundancy mechanism called graduated declustering.We have
implemented a number of data-intensive applications on River, which validate
our design with near-ideal performance in a variety of non-uniform
performance scenarios.
Keywords: cluster computing, parallel I/O,
pario-bib
Keywords: parallel I/O, prefetching, parallel file
system, pario-bib
Comment: A related paper is arunachalam:prefetch2.
Abstract: The significant difference between the
speeds of the I/O system (e.g., disks) and compute processors in parallel
systems creates a bottleneck that lowers the performance of an application
that does a considerable amount of disk accesses. A major portion of the
compute processors' time is wasted on waiting for I/O to complete. This
problem can be addressed to a certain extent, if the necessary data can be
fetched from the disk before the I/O call to the disk is issued. Fetching
data ahead of time, known as prefetching in a multiprocessor environment
depends a great deal on the application's access pattern. The subject of this
paper is implementation and performance evaluation of a prefetching prototype
in a production parallel file system on the Intel Paragon. Specifically, this
paper presents a) design and implementation of a prefetching strategy in the
parallel file system and b) performance measurements and evaluation of the
file system with and without prefetching. The prototype is designed at the
operating system level for the PFS. It is implemented in the PFS subsystem of
the Intel Paragon Operating System. It is observed that in many cases
prefetching provides considerable performance improvements. In some other
cases no improvements or some performance degradation is observed due to the
overheads incurred in prefetching.
Keywords: parallel I/O, prefetching,
multiprocessor file system, pario-bib
Comment: See arunachalam:prefetch.
Keywords: parallel I/O, disk array, RAID,
pario-bib
Comment: Part of jin:io-book; reformatted version
of asami:self.
Abstract: This paper shows the suitability of a
self-maintaining approach to Tertiary Disk, a large-scale disk array system
built from commodity components. Instead of incurring the cost of custom
hardware, we attempt to solve various problems by design and software. We
have built a cluster of storage nodes connected by switched Ethernet. Each
storage node is a PC hosting a few dozen SCSI disks, running the FreeBSD
operating system. The system is used as a web-based image server for the Zoom
Project in cooperation with the Fine Arts Museums of San Francisco
(http://www.thinker.org/). We are designing self-maintenance extension to the
OS to run on this cluster to mitigate the system administrator's burden.
There are several components required for building self-maintaining system.
One is decoupling the time failure from the time of hardware replacement.
This implies the system must have some amount of redundancy, and has no
single point of failure. Our system is fully redundant, and everything is
constructed to avoid a single point of failure. Another is correctly
identifying failures and their dependencies. The paper also outlines several
approaches to lower the human cost of system administration of such a system
and making the system as autonomous as possible.
Keywords: parallel I/O, disk array, RAID,
pario-bib
Keywords: parallel I/O, hypercube, Intel iPSC/2,
file access pattern, pario-bib
Keywords: parallel I/O, architecture, pario-bib
Comment: They describe an I/O subsystem based on
an ``active memory'' called SWIM (Structured Wafer-based Intelligent Memory).
SWIM chips are RAM chips with some built-in processing. The idea is that
these tiny processors can manipulate the data in the chip at full speed,
without dealing with memory bus or off-chip costs. Further, the chips can
work in parallel. They demonstrate how they've used this to build a national
phone database server, a high-performance IP router, and a call-screening
agent.
Abstract: We describe an I/O subsystem based on an
active memory named SWIM (Structured Wafer-based Intelligent Memory) designed
for efficient storage and manipulation of data structures. The key
architectural idea in SWIM is to associate some processing logic with each
memory chip that allows it to perform data manipulation operations locally
and to communicate with a disk or a communication line through a backend
port. The processing logic is specially designed to perform operations such
as pointer dereferencing, memory indirection, searching and bounds checking
efficiently. The I/O subsystem is built using an interconnected ensemble of
such memory logic pairs. A complex processing task can now be distributed
between a large number of small memory processors each doing a sub-task,
while still retaining a common locus of control in the host CPU for higher
level administrative and provisioning functions. We argue that active memory
based processing enables more powerful, scalable and robust designs for
storage and communications subsystems, that can support emerging network
services, multimedia workstations and wireless PCS systems. A complete
parallel hardware and software system constructed using an array of SWIM
elements has been operational for over a year. We present results from
application of SWIM to three network functions: a national phone database
server, a high performance IP router, and a call screening agent.
Keywords: parallel I/O architecture, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: parallel I/O, pario-bib
Comment: They describe using the techniques of
delrosario and debenedictis (although without mentioning them) to provide for
channels (parallel pipes) between independent data-parallel tasks. The
technique really is the same as in debenedictus and delrosario, although they
extend it a bit to allow multiple "files" within a channel (why not use
multiple channels)? Also, they depend on the program to read and write
synchronization variables to control access to the flow of data through the
channel. While this may provide good performance in some cases, why not have
support for automatic flow control? The system can detect when a portion of
the channel is written, and release readers waiting on that portion of the
channel (if any). The paper is a bit confusing in its use of the word "file",
which seems to be used to mean different things at different points. Also,
they seem to use an arbitrary distribution for the "file", which may or may
not be the same as one of those used by the two endpoints.
Abstract: Access to shared data is critical to the
long term success of grids of distributed systems. As more parallel
applications are being used on these grids, the need for some kind of
parallel I/O facility across distributed systems increases. However, grid
middleware has thus far had only limited support for distributed parallel
I/O. In this paper, we present an implementation of the MPI-2 I/O interface
using the Globus GridFTP client API. MPI is widely used for parallel
computing, and its I/O interface maps onto a large variety of storage
systems. The limitations of using GridFTP as an MPI-I/O transport mechanism
are described, as well as support for parallel access to scientific data
formats such as HDF and NetCDF. We compare the performance of GridFTP to that
of NFS on the same network using several parallel I/O benchmarks. Our tests
indicate that GridFTP can be a workable transport for parallel I/O,
particularly for distributed read-only access to shared data sets. (26 refs.)
Keywords: grid I/O, MPI-I/O, grid middleware,
gridFTP, pario-bib
Keywords: parallel I/O, pario-bib, prefetching,
caching, multiprocessor file system, file access pattern
Comment: Basically there are two parts to this
paper. First, they will instrument applications, Intel PFS, and IBM Vesta, to
trace I/O-related activity. Then they'll use Pablo to analyze and
characterize. They plan to trace some events in detail, and the rest with
histogram counters. Second, they plan to develop caching and prefetching
policies and to analyze those with simulation, analysis, and implementation.
They note that IBM and Intel are developing parallel I/O architecture
simulators. See also poole:sio-survey, choudhary:sio-language,
bershad:sio-os.
Abstract: RAID storage arrays often possess
gigabytes of RAM for caching disk blocks. Currently, most RAID systems use
LRU or LRU-like policies to manage these caches. Since these array caches do
not recognize the presence of file system buffer caches, they redundantly
retain many of the same blocks as those cached by the file system, thereby
wasting precious cache space. In this paper, we introduce X-RAY, an exclusive
RAID array caching mechanism. X-RAY achieves a high degree of (but not
perfect) exclusivity through gray-box methods: by observing which files have
been accessed through updates to file system meta-data, X-RAY constructs an
approximate image of the contents of the file system cache and uses that
information to determine the exclusive set of blocks that should be cached by
the array. We use microbenchmarks to demonstrate that X-RAY's prediction of
the file system buffer cache contents is highly accurate, and trace-based
simulation to show that X-RAY considerably outperforms LRU and performs as
well as other more invasive approaches. The main strength of the X-RAY
approach is that it is easy to deploy - all performance gains are achieved
without changes to the SCSI protocol or the file system above.
Keywords: RAID, x-ray, caching policies, pario-bib
Keywords: parallel I/O, distributed file system,
mass storage, pario-bib
Comment: Architecture for distributed information
storage. Integrates file systems, databases, etc. Single system image, lots
of support for administration. O-O model, with storage device objects,
logical device objects, volume objects, and file objects. Methods for each
type of object, including administrative methods.
Abstract: Modern applications such as `video on
demand' require fast reading of complete files, which can be supported well
by file striping. Many conventional applications, however, are only
interested in some part of the available records. In order to avoid reading
attributes irrelevant to such applications, each attribute could be stored in
a separate (transposed) file; Aiming at I/O parallelism, byte-oriented
striping could be applied to transposed files. However, such a fragmentation
ignores the semantics of data. This fragmentation cannot be optimized by a
database management system (DBMS) because a DBMS has to perform its tasks on
the basis of data semantics. For example, queries must be translated into
file operations using a scheme that maps a data model to a file system.
However, details about files, such as the striping width, are invisible to a
DBMS. Therefore, we propose to store each transposed file related to a
composite type on a separate, independent disk drive, which means I/O
parallelism tuned to a data model. As we also aim at system reliability and
data availability, each transposed file must be duplicated on another drive.
Consequently, a DBMS also has to guarantee correctness and completeness of
the allocation of transposed files within an array of disk drives. As a
solution independent of the underlying data model, we propose an abstract
framework consisting of a meta model and a set of rules
Keywords: database, parallel I/O, pario-bib
Keywords: multiprocessor file system, file access
pattern, parallel I/O, hypercube, pario-bib
Comment: Census-data processing on an nCUBE/10 at
USC. Their program uses an interleaved pattern, which is like my lfp or gw
with multi-record records (i.e., the application does its own blocking).
Shifted to asynchronous I/O to do OBL manually. Better results if they did
more computation per I/O (of course).
The first method is a
straightforward out-of-core variant of a well-known method for in-core,
multidimensional FFTs. It performs 1-dimensional FFT computations on each
dimension in turn. This method is easy to generalize to any number of
dimensions, and it also readily permits the individual dimensions to be of
any sizes that are integer powers of 2. The key step is an out-of-core
transpose operation that places the data along each dimension into contiguous
positions on the parallel disk system so that the data for the 1-dimensional
FFTs are contiguous. The second method is an adaptation of another
well-known method for in-core, multidimensional FFTs. This method computes
all dimensions simultaneously. It is more difficult to generalize to
arbitrary radices and number of dimensions in this method than in the first
method. Our present implementation is therefore limited to two dimensions of
equal size, that are again integer powers of 2. We present I/O
complexity analyses for both methods as well as empirical results for a
DEC 2100 server and an SGI Origin 2000, each of which has a parallel disk
system. Our results indicate that the methods are comparable in speed in
two-dimensions. Abstract: We show two algorithms for computing
multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system
with distributed memory when problem sizes are so large that the data do not
fit in the memory of the entire system. Instead, data reside on a parallel
disk system and are brought into memory in sections. We use the Parallel Disk
Model for implementation and analysis.
Keywords: parallel I/O, out of core, FFT, parallel
algorithm, scientific computing, pario-bib
Comment: Undergraduate Honors Thesis. Advisor: Tom
Cormen.
Keywords: parallel I/O, hashing, reliability, disk
mirroring, pario-bib
Comment: Describes a file system for a distributed
system that scatters records of each file over many disks using hash
functions. The hash function is known by all processors, so no one processor
must be up to access the file. Any portion of the file whose disknode is
available may be accessed. Shadow nodes are used to take over for nodes that
go down, saving the info for later use by the proper node. Intended to easily
parallelize read/write accesses and global file operations, and to increase
file availability.
Keywords: disk model, I/O bus, device model, I/O
model, pario-bib
Keywords: disk model, I/O bus, device model, I/O
model, pario-bib
Given a D-disk parallel I/O
system and a globally shared I/O buffer that can hold upto M disk blocks, we
derive a lower bound of $Ω(\sqrt{D}$) on the competitive ratio of any
deterministic online prefetching algorithm with O(M) lookahead. NOM is shown
to match the lower bound using global M-block lookahead. In contrast, using
only local lookahead results in an $Ω(D)$ competitive ratio. When the
buffer is distributed into D portions of M/D blocks each, the algorithm GREED
based on local lookahead is shown to be optimal, and NOM is within a constant
factor of optimal. Thus we provide a theoretical basis for the intuition that
global lookahead is more valuable for prefetching in the case of a shared
buffer configuration whereas it is enough to provide local lookahead in case
of the distributed configuration. Finally, we analyze the performance of
these algorithms for reference strings generated by a uniformly-random
stochastic process and we show that they achieve the minimal expected number
of I/Os. These results also give bounds on the worst-case expected
performance of algorithms which employ randomization in the data layout.
Abstract: We provide a competitive analysis
framework for online prefetching and buffer management algorithms in parallel
I/O systems, using a read-once model of block references. This has widespread
applicability to key I/O-bound applications such as external merging and
concurrent playback of multiple video streams. Two realistic lookahead
models, global lookahead and local lookahead, are defined. Algorithms NOM and
GREED based on these two forms of lookahead are analyzed for shared buffer
and distributed buffer configurations, both of which occur frequently in
existing systems. An important aspect of our work is that we show how to
implement both the models of lookahead in practice using the simple
techniques of forecasting and flushing.
Keywords: disk prefetching, file caching, parallel
I/O, pario-bib
Comment: See also barve:competitive. They propose
two methods for scheduling prefetch operations in the situation where the
access pattern is largely known in advance, in such a way as to minimize the
total number of parallel I/Os. The two methods are quite straightforward, and
yet match the optimum lower bound for an on-line algorithm.
The upper bound derived on expected
I/O performance of SRM indicates that SRM is provably better than
disk-striped mergesort (DSM) for realistic parameter values D, M, and B.
Average-case simulations show further improvement on the analytical upper
bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms
DSM even when the number D of parallel disks is small. Abstract: We consider the problem of sorting a
file of N records on the D-disk model of parallel I/O in which there are two
sources of parallelism. Records are transferred to and from disk concurrently
in blocks of B contiguous records. In each I/O operation, up to one block can
be transferred to or from each of the D disks in parallel. We propose a
simple, efficient, randomized mergesort algorithm called SRM that uses a
forecast-and-flush approach to overcome the inherent difficulties of simple
merging on parallel disks. SRM exhibits a limited use of randomization and
also has a useful deterministic version. Generalizing the technique of
forecasting, our algorithm is able to read in, at any time, the ``right''
block from any disk, and using the technique of flushing, our algorithm
evicts, without any I/O overhead, just the ``right'' blocks from memory to
make space for new ones to be read in. The disk layout of SRM is such that it
enjoys perfect write parallelism, avoiding fundamental inefficiencies of
previous mergesort algorithms. By analysis of generalized maximum occupancy
problems we are able to derive an analytical upper bound on SRM's expected
overhead valid for arbitrary inputs.
Keywords: parallel I/O algorithm, sorting,
pario-bib
Comment: This paper formerly called
barve:mergesort; I discovered that the paper had appeared in SPAA96, so the
SPAA96 paper is now called barve:mergesort.
Keywords: parallel I/O algorithm, sorting,
pario-bib
Abstract: In modern I/O architectures, multiple
disk drives are attached to each I/O bus. Under I/O-intensive workloads, the
disk latency for a request can be overlapped with the disk latency and data
transfers of requests to other disks, potentially resulting in an aggregate
I/O throughput at nearly bus bandwidth. This paper reports on a performance
impairment that results from a previously unknown form of convoy behavior in
disk I/O, which we call rounds. In rounds, independent requests to distinct
disks convoy, so that each disk services one request before any disk services
its next request. We analyze log files to describe read performance of
multiple Seagate Wren-7 disks that share a SCSI bus under a heavy workload,
demonstrating the rounds behavior and quantifying its performance impact.
Keywords: disk, I/O bus, parallel I/O, pario-bib
Keywords: parallel architecture, array processor,
parallel I/O, SIMD, pario-bib
Comment: This paper is reproduced in Kuhn and
Padua's (1981, IEEE) survey ``Tutorial on Parallel Processing.'' The STARAN
is an array processor that uses Multi-Dimensional-Access (MDA) memories and
permutation networks to access data in bit slices in a variety of ways, with
high-speed I/O capabilities. Its router (called the flip network) could
permute data among the array processors, or between the array processors and
external devices, including disks, video input, and displays.
Keywords: parallel I/O, parallel architecture,
simulation, pario-bib
Keywords: performance evaluation, parallel
architecture, parallel I/O, pario-bib
Comment: They use a simulator to evaluate the
performance of a parallel I/O system. They simulate the network and disks
under a synthetic workload, and measure the time it takes for I/O requests to
traverse the network, be processed, and return. They also measure the impact
of I/O requests on non-I/O messages. Their results are fairly unsurprising.
Keywords: performance evaluation, parallel
architecture, parallel I/O, pario-bib
Keywords: parallel I/O, parallel architecture,
performance analysis, pario-bib
Comment: See polished version
baylor:vulcan-perf-book. Simulation of the I/O architecture for the Vulcan
MPP at IBM TJW. This is a distributed-memory MIMD system with a bidirectional
omega-type interconnection network, and separate compute and I/O nodes. They
use a stochastic workload to evaluate the average I/O performance under a few
different situations, and then use that average performance, along with a
stochastic workload, in a detailed simulation of the interconnection network.
(What would be the effect of adding variance to the I/O-node performance?) A
key point is that the I/O node will not accept any more requests until a
current write request is finished being processed (copied into the write-back
cache). If there are many writes, this can backup the network (would a
different write-request protocol help?) Not clear how concurrency of reads
are modeled. Results show that network saturates for high request rates and
small number of I/O nodes. As request rate decreases or number of I/O nodes
increases, performance levels off to a reasonable value. Placement of I/O
nodes didn't make much difference, nor did extra non-I/O traffic. Given their
parameters, and for reasonable loads, 1 I/O node per 4 compute nodes was a
reasonable balance, and was scalable.
Abstract: Presented are the trace-driven
simulation results of a study conducted to evaluate the performance of the
internal parallel I/O subsystem of the Vulcan massively parallel processor
(MPP) architecture. The system sizes evaluated vary from 16 to 512 nodes. The
results show that a compute node to I/O node ratio of four is the most cost
effective for all system sizes, suggesting high scalability. Also,
processor-to-processor communication effects are negligible for small message
sizes and the greater the fraction of I/O reads, the better the I/O
performance. Worse case I/O node placement is within 13% of more efficient
placement strategies. Introducing parallelism into the internal I/O subsystem
improves I/O performance significantly.
Keywords: parallel I/O architecture, performance
evaluation, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Abstract: In recent years, the design and
performance evaluation of parallel processors has focused on the processor,
memory and communication subsystems. As a result, these subsystems have
better performance potential than the I/O subsystem. In fact, the I/O
subsystem is the bottleneck in many machines. However, there are a number of
studies currently underway to improve the design of parallel I/O subsystems.
To develop optimal parallel I/O subsystem designs, one must have a thorough
understanding of the workload characteristics of parallel I/O and its
exploitation of the associated parallel file system. Presented are the
results of a study conducted to analyze the parallel I/O workloads of several
applications on a parallel processor using the Vesta parallel file system.
Traces of the applications are obtained to collect system events,
communication events, and parallel I/O events. The traces are then analyzed
to determine workload characteristics. The results show I/O request rates on
the order of hundreds of requests per second, a large majority of requests
are for small amount of data (less than 1500 bytes), a few requests are for
large amounts of data (on the order of megabytes), significant file sharing
among processes within a job, and strong temporal, traditional spatial, and
interprocess spatial locality.
Keywords: parallel I/O, workload characterization,
pario-bib
Comment: See polished version
baylor:workload-book. They characterize four parallel applications: sort,
matrix multiply, seismic migration, and video server, in terms of their I/O
activity. They found results that are consistent with kotz:workload, in that
they also found lots of small data requests, some large data requests,
significant file sharing and interprocess locality. This study found less of
the non-contiguous access than did kotz:workload, because of the logical
views provided by Vesta. Note on-line postscript does not include figures.
Abstract: To develop optimal parallel I/O
subsystems, one must have a thorough understanding of the workload
characteristics of parallel I/O and its exploitation of the associated
parallel file system. Presented are the results of a study conducted to
analyze the parallel I/O workloads of several applications on a parallel
processor using the Vesta parallel file system. Traces of the applications
are obtained to collect system events, communication events, and parallel I/O
events. The traces are then analyzed to determine workload characteristics.
The results show I/O request rates on the order of hundreds of requests per
second, a large majority of requests are for small amounts of data (less than
1500 bytes), a few requests are for large amounts of data (on the order of
megabytes), significant file sharing among processes within a job, and strong
temporal, traditional spatial, and interprocess spatial locality.
Keywords: parallel I/O, file access pattern,
workload characterization, file system workload, pario-bib
Comment: Part of a whole book on parallel I/O; see
iopads-book.
Keywords: BBN, parallel I/O, pario-bib
Comment: Administrative manual for the TC2000 I/O
system. Can stripe over partitions in a user-specified set of disks. Large
requests automatically split and done in parallel. See also garber:tc2000.
Keywords: parallel I/O, scientific computing,
linear algebra, pario-bib
Comment: They look at out-of-core block and slab
solvers for the Maspar. They overlap reading one block with the computation
of the previous block. They solve matrices up to 40k x 40k, and obtain 3.14
GFlops even with I/O considered.
Keywords: file access pattern, disk prefetch, file
system, pario-bib
Comment: A specialized database system for
particle physics codes. Valuable for its description of access patterns and
subsequent file access requirements. Particle-in-cell codes iterate over
timesteps, updating the position of each particle, and then the
characteristics of each cell in the grid. Particles may move from cell to
cell. Particle update needs itself and nearby gridcell data. The whole
dataset is too big for memory, and each timestep must be stored on disk for
later analysis anyway. Regular file systems are inadequate: specialized DBMS
is more appropriate. Characteristics needed by their application class:
multidimensional access (by particle type or by location, i.e., multiple
views of the data), coordination between grid and particle data, coordination
between processors, coordinated access to meta-data, inverted files,
horizontal clustering, large blocking of data, asynchronous I/O, array data,
complicated joins, and prefetching according to user-prespecified order. Note
that many of these things can be provided by a file system, but that most are
hard to come by in typical file systems, if not impossible. Many of these
features are generalizable to other applications.
Keywords: hypercube, graphics, parallel algorithm,
parallel I/O, pario-bib
Comment: About using the nCUBE/10's RT Graphics
System. They were frustrated by an unusual mapping from the graphics memory
to the display, a shortage of memory on the graphics nodes, and small message
buffers on the graphics nodes. They wrote some algorithms for collecting the
columns of pixels from the hypercube nodes, and routing them to the
appropriate graphics node. They also would have liked a better
interconnection network between the graphics nodes, at least for
synchronization.
Keywords: parallel I/O, pario-bib
Comment: Jovian is a runtime library for use with
SPMD codes, eg, HPF. They restrict IO to collective operations, and provide
extra processes to 'coalesce' the many requests from multiple CPs into fewer
larger requests to the operating system, perhaps optimized for access order.
They mention that there is a standardization process underway for specifying
data distributions. Also a compact representation for strided access to
n-dimensional data structures. Coalescing basically means combining requests
to eliminate duplication and to combine adjacent requests. Requests to
coalescers are in full blocks, to lower the processing overhead. Nonetheless,
their method involves moving requests around twice, and involve several
memory-memory copies of the data, so their overhead is high.
Keywords: parallel I/O, network, supercomputer
system, pario-bib
Comment: An update of berdahl:woodenman, close to
the final draft.
Keywords: parallel I/O, network, supercomputer
system, pario-bib
Comment: They describe a protocol for making
parallel data transfers of arbitrary data sets from one set of data servers
to another set of data servers. The goal is to be independent of specific
architectures or even types of data servers, and to work on top of existing
transport protocols. The data set is described using a gather set for the
source and a scatter set for the destination, and using a linear address
space as an intermediate representation. All the servers are contacted, they
figure out who they need to talk, and exchange port information with them.
Each pair exchanges votes on who will control the transfer (ie, who will
control the order of the transfer), and on their maximum data rates. This
information is used to settle on the control and set of ports to be used.
This proposal is not final and is under active development, so it may change.
Keywords: parallel computing, performance
evaluation, parallel file system, pario-bib
Comment: In German. They summarize typical
performance of the Intel Paragon, including the communication performance and
the parallel file-system performance.
Keywords: scientific computation, application,
parallel I/O, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Four major components: networking, memory
servers, file system, and persistent object store. Networking part focuses on
low-latency support communication within an application, between
applications, and between machines (Bershad and Peterson). Memory servers,
shared virtual memory, and checkpointing support (Kai Li). File systems
support includes benchmarking, transparent informed prefetching (Gibson), a
common interface for PFS and Vesta (Snir), and integrating secondary and
tertiary storage systems (including the integration of the National Storage
Lab's HPSS (see coyne:hpss) into this project in 1995). OSF/1 (Black) will be
extended to support parallel file systems, extent-like behavior, and block
coalescing. Persistent object store (DeWitt) is radical change to an
object-oriented interface, transparent I/O (though extensible and changable
with subclassing, presumably), and heterogeneous support via the Object
Definition Language standard. Persistent objects may be integrated with the
memory servers and shared virtual memory. See also poole:sio-survey,
bagrodia:sio-character, choudhary:sio-language.
Keywords: fault tolerance, multimedia, video on
demand, parallel I/O, pario-bib
Keywords: parallel I/O, multiprocessor file
system, pario-bib
Comment: Much like Intel CFS, with different I/O
modes that determine when the compute nodes synchronize, and the semantics of
I/Os written to the file. They found it hard to get good bandwidth for
independent I/Os, as opposed to coordinated I/Os; part of this was due to
their RAID 3 disk array, but it is more complicated than that. Some
performance numbers were given in talk.
Keywords: RAID, disk array, reliability, parallel
I/O, pario-bib
Comment: Uses the Information Dispersal Algorithm
(IDA) to generate $n+m$ blocks from $n$ blocks, to tolerate $m$ disk
failures; all of the data from the $n$ blocks is hidden in the $n+m$ blocks.
Not with the RAID project.
Abstract: In wide area computing, programs
frequently execute at sites that are distant from their data. Data access
mechanisms are required that place limited functionality demands on an
application or host system yet permit high-performance implementations. To
address these requirements, we propose a data movement and access service
called Global Access to Secondary Storage (GASS). This service defines a
global name space via Uniform Resource Locators and allows applications to
access remote files via standard I/O interfaces. High performance is achieved
by incorporating default data movement strategies that are specialized for
I/O patterns common in wide area applications and by providing support for
programmer management of data movement. GASS forms part of the Globus
toolkit, a set of services for high-performance distributed computing. GASS
itself makes use of Globus services for security and communication, and other
Globus components use GASS services for executable staging and real-time
remote monitoring. Application experiences demonstrate that the library has
practical utility.
Keywords: wide-area network, parallel I/O,
pario-bib
Keywords: data grid, filter, pario-bib
Keywords: parallel I/O, disk shadowing,
reliability, disk mirroring, disk optimization, pario-bib
Comment: Goes further than bitton:shadow. Uses
simulation to verify results from that paper, which were expressions for the
expected seek distance of shadowed disks, using shortest-seek-time arm
scheduling. Problem is her assumption that arm positions stay independent, in
the face of correlating effects like writes, which move all arms to the same
place. Simulations match model only barely, and only in some cases. Anyway,
shadowed disks can improve performance for workloads more than 60 or 70%
reads.
Keywords: parallel I/O, disk shadowing,
reliability, disk mirroring, disk optimization, pario-bib
Comment: Also TR UIC EECS 88-1 from Univ of
Illinois at Chicago. Shadowed disks are mirroring with more than 2 disks.
Writes to all disks, reads from one with shortest seek time. Acknowledges but
ignores problem posed by lo:disks. Also considers that newer disk technology
does not have linear seek time $(a+bx)$ but rather $(a+b\sqrt{x})$. Shows
that with either seek distribution the average seek time for workloads with
at least 60% reads decreases in the number of disks. See also
bitton:schedule.
Keywords: parallel I/O, file access pattern,
pario-bib
Comment: A substantial part of this
structural-analysis application was involved in I/O, moving substructures in
and out of RAM. The Maspar IO-RAM helped a lot, nearly halving the time
required. On the Cray, the SSD had an even bigger impact, perhaps 7-12 times
faster. Their main conclusion is that caching helped. Most likely this was
due to its double-buffering, since they structured the code to
read/compute/write in large ``superblocks''.
Keywords: disk array, RAID, parallel I/O,
pario-bib
Comment: Part of jin:io-book.
Abstract: There is a growing interest in using
Java as the language for developing high-performance computing applications.
To be successful in the high-performance computing domain, however, Java must
not only be able to provide high computational performance, but also
high-performance I/O. In this paper, we first examine several approaches that
attempt to provide high-performance I/O in Java-many of which are not
obvious at first glance-and evaluate their performance on two parallel
machines, the IBM SP and the SGI Origin2000. We then propose extensions to
the Java I/O library that address the deficiencies in the Java I/O API and
improve performance dramatically. The extensions add bulk (array) I/O
operations to Java, thereby removing much of the overhead currently
associated with array I/O in Java. We have implemented the extensions in two
ways: in a standard JVM using the Java Native Interface (JNI) and in a
high-performance parallel dialect of Java called Titanium. We describe the
two implementations and present performance results that demonstrate the
benefits of the proposed extensions.
Keywords: parallel I/O, Java, file system
interface, pario-bib
Abstract: There is a growing interest in using
Java as the language for developing high-performance computing applications.
To be successful in the high-performance computing domain, however, Java must
not only be able to provide high computational performance, but also
high-performance I/O. In this paper, we first examine several approaches that
attempt to provide high-performance I/O in Java-many of which are not
obvious at first glance-and evaluate their performance on two parallel
machines, the IBM SP and the SGI Origin2000. We then propose extensions to
the Java I/O library that address the deficiencies in the Java I/O API and
improve performance dramatically. The extensions add bulk (array) I/O
operations to Java, thereby removing much of the overhead currently
associated with array I/O in Java. We have implemented the extensions in two
ways: in a standard JVM using the Java Native Interface (JNI) and in a
high-performance parallel dialect of Java called Titanium. We describe the
two implementations and present performance results that demonstrate the
benefits of the proposed extensions.
Keywords: parallel I/O, java, file system
interface, pario-bib
Keywords: parallel I/O, database, disk caching,
pario-bib
Comment: More recent than copeland:bubba, and a
little more general. This gives few details, and doesn't spend much time on
the parallel I/O. Bubba does use parallel independent disks, with a
significant effort to place data on the disks, and do the work local to the
disks, to balance the load and minimize interprocessor communication. Also
they use a single-level store (i.e., memory-mapped files) to improve
performance of their I/O system, including page locking that is assisted by
the MMU. The OS has hooks for the database manager to give memory-management
policy hints.
Keywords: file access pattern, parallel I/O,
database machine, pario-bib
Comment: Improvements in I/O bandwidth crucial for
supporting database machines, otherwise highly parallel DB machines are
useless (I/O bound). Two ways to do it: 1) synchronized interleaving by using
custom controller and regular disks to read/write same track on all disks,
which speeds individual accesses. 2) use very large cache (100-200M) to keep
blocks to re-use and to do prefetching. But see dewitt:pardbs.
Keywords: collective I/O, multiprocessor file
system, parallel I/O, pario-bib
Comment: bordawekar:collective was renamed
bordawekar:collective-tr, so this could be called bordawekar:collective.
Abstract: A majority of parallel applications
obtain parallelism by partitioning data over multiple processors. Accessing
distributed data structures like arrays from files often requires each
processor to make a large number of small non-contiguous data requests. This
problem can be addressed by replacing small non-contiguous requests by large
collective requests. This approach, known as Collective I/O, has been found
to work extremely well in practice. In this paper, we describe implementation
and evaluation of a collective I/O prototype in a production parallel file
system on the Intel Paragon. The prototype is implemented in the PFS
subsystem of the Intel Paragon Operating System. We evaluate the collective
I/O performance using its comparison with the PFS M_RECORD and M_UNIX I/O
modes. It is observed that collective I/O provides significant performance
improvement over accesses in M_UNIX mode. However, in many cases, various
implementation overheads cause collective I/O to provide lower performance
than the M_RECORD I/O mode.
Keywords: parallel I/O, mutliprocessor file
system, pario-bib
Comment: This tech report was called
bordawekar:collective, then renamed bordawekar:collective-tr, on the
appearance of the ICS paper bordawekar:collective.
Keywords: parallel I/O, inter-processor
communication, pario-bib
Comment: bordawekar:comm-tr is nearly identical in
content. Also bordawekar:commstrat is a shorter version.
Abstract: In this paper, we show that
communication in the out-of-core distributed memory problems requires both
inter-processor communication and file I/O. Given that primary data
structures reside in files, even communication requires I/O. Thus, it is
important to optimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method, termed as the "out-of-core" communication
method, follows a loosely synchronous model. Computation and Communication
phases in this case are clearly separated, and communication requires
permutation of data in files. The second method, termed as
"demand-driven-in-core communication" considers only communication required
of each in-core data slab individually. The third method, termed as
"producer-driven-in-core communication" goes even one step further and tries
to identify the potential (future) use of data while it is in memory. We
describe these methods in detail and provide performance results for
out-of-core applications; namely, two-dimensional FFT and two-dimensional
elliptic solver. Finally, we discuss how "out-of-core" and "in-core"
communication methods could be used in virtual memory environments on
distributed memory machines.
Keywords: parallel I/O, inter-processor
communication, pario-bib
Comment: They compare different ways to do global
communications in out-of-core applications, involving file I/O and
communication at different times. They also comment briefly on how it would
work if it depended on virtual memory at each node.
Keywords: interprocessor communication, parallel
I/O, pario-bib
Comment: Small version of bordawekar:comm.
Abstract: It is widely acknowledged that improving
parallel I/O performance is critical for widespread adoption of high
performance computing. In this paper, we show that communication in
out-of-core distributed memory problems may require both inter-processor
communication and file I/O. Thus, in order to improve I/O performance, it is
necessary to minimize the I/O costs associated with a communication step. We
present three methods for performing communication in out-of-core distributed
memory problems. The first method called the generalized collective
communication method follows a loosely synchronous model; computation and
communication phases are clearly separated, and communication requires
permutation of data in files. The second method called the receiver-driven
in-core communication considers only communication required of each in-core
data slab individually. The third method called the owner-driven in-core
communication goes even one step further and tries to identify the potential