Unlike traditional parallel computers, grid applications execute in environments with unavoidable latency, low bandwidth, and unpredictable behavior. Our data-flow-based solution allows the application programmer and the dataset provider describe and deploy a network of application-specific and dataset-specific functionality across the grid. The application (or application library) controls virtually all aspects of the I/O system through a distributed graph of application components, including the application interface, optimization policies like caching and prefetching, and remote filtering of datasets. Armada supports remote execution by allowing the different components to execute on processors used by the client application, on processors used by storage servers, or on intermediate processors in the network. Our system automatically restructures the graph to distribute data flow and computation throughout the graph, and it places individual components using a scheme that is both beneficial to the application and considerate of administrative-domain allocation policies.
Our approach demonstrates that a flexible design along with careful attention to data-flow performance can lead to efficient I/O for grid applications. Our performance results show that the data-flow model does an exceptional job of hiding network latency inherent in grid computing. Applications using Armada perform well in low-bandwidth environments because restructured graphs allow an effective placement of Armada ships. Applications using Armada also perform well in high-bandwidth environments because restructured graphs often include end-to-end parallelism.