Papers

https://www.cs.dartmouth.edu/~sws/abstracts/sjt95.shtml Last modified: 08/27/03 11:56:53 AM

S.W. Smith, D.B. Johnson, J.D. Tygar.
``Completely Asynchronous Optimistic Recovery with Minimal Rollbacks.''
25th International Symposium on Fault-Tolerant Computing.
1995.

Abstract

Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on message logging and replay is desirable since failure-free operation requires no synchronization between processes, and since logging a received message is cheaper than recording a checkpoint. Furthermore, surviving processes have the ability to recreate states other than those recorded in checkpoints- so only computation that depends on the failure must be rolled back. Although optimistic rollback recovery protocols make failure-free operation even cheaper by logging received messages asynchronously, optimism complicates recovery. Previous optimistic rollback recovery protocols have either required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. In this paper, we present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one.

Download

PDF

Papers

Abstract

Download

See Also