Last modified: 08/27/03 11:56:53 AM
S.W. Smith, D.B. Johnson, J.D. Tygar.
``Completely Asynchronous Optimistic Recovery with Minimal Rollbacks.''
25th International Symposium on Fault-Tolerant Computing.
Consider the problem of transparently recovering an
asynchronous distributed computation when one or more
processes fail. Basing rollback recovery on message logging
and replay is desirable since failure-free operation requires
no synchronization between processes, and since logging
a received message is cheaper than recording a checkpoint.
Furthermore, surviving processes have the ability to
recreate states other than those recorded in checkpoints-
so only computation that depends on the failure must be
rolled back. Although optimistic rollback recovery protocols
make failure-free operation even cheaper by logging
received messages asynchronously, optimism complicates
recovery. Previous optimistic rollback recovery protocols
have either required synchronization during recovery, or
have permitted a failure at one process to potentially trigger
an exponential number of process rollbacks. In this
paper, we present an optimistic rollback recovery protocol
that provides completely asynchronous recovery, while also
reducing the number of times a process must roll back in
response to a failure to at most one.
Smith Johnson 1996