Fault Tolerance

Speaker was Keith Marzullo, Department of Computer Science and Engineering
University of California, San Diego.

Questions should be addressed to

The guiding question was: Does the notion of fault tolerance change when applied to mobile agents?

Three definitions of fault tolerance were offered:

  1. masking
  2. detection and recovery
  3. atomicity
Masking goes back to von NeumannÝs scheme of "replication and voting". Separate units do their computations and the result is decided upon by majority vote.

Detection and recovery requires less replication than masking. Detecting errors is easier than correcting them.

Atomicity (Indivisibility) is borrowed from database transaction models. All effects of all operations of a transaction are committed, or none of them are.

Issues pertaining to mobile agents:

The problem of masking is spoofing: The "wrong" unit forwards results to the next stage, which can skew majority vote results. A possible solution, proposed by Fred Schneider (Cornell University), is authenticating trajectory protocol. The question remains whether this expensive, complex protocol is suited for mobile agents.


An alternative would be the so-called Norwegian Army protocol. The idea is that the mobile agent leaves data at the nodes it has visited. This data can be used to recover agent state. If an agent fails at node W at time t1, data it left at node C ("the rear guard") at time t0 can be used for recovery.

Another technique would be Fail-Stop Reliable Broadcast. The general idea is to traverse a hierarchical tree and "adopt" children whose parents are faulty. As it pertains to agents, use similar technique before execution is started. Term introduced was Linear Broadcast strategy, based on primary backup.

Another alternative would be transactions. Problem is that idea is deceptively simple ˇ overlocking of the system can occur, but the speaker felt that this could be overcome with mobility.

Finally, idea of Programming for Fault Tolerance was introduced. This should be built on some variant of agreement or reliable broadcast.

The lesson to take home is that detection and recovery is more appropriate for mobile agents than masking.

Discussion (in chronological order):

Q: What happens when the "rear guard" is disconnected?

A: You have to be smart in choosing your rear guard.

Q: Speaker mentioned en passant intrusion detection models, but dismissed them. Why?

A: Intrusion work is very immature: "turn on a red light" was the phrase speaker used, "this is what they do, thatÝs too limited".

Q: What is new about Agents and Fault Tolerance?

A: Masking could be used for detection with mobile agents.

C: Distributed agents make coherent atomic commit SAGA difficult, because there is no controller.

A: This writer could not follow the answer to this comment.

Q: Failures like "unknown host", "network address not found" much more common than dead agents. IsnÝt this focus on mobile agents misplaced?

A: This writer could not follow the answer to this question.

Q: Does mobility help fault tolerance in the sense that on-the-fly replication of "good" components can drown out "bad" ones?

A: Yes, that is a good idea in theory, but no system is known to be deployed that does this. Some audience member mentioned that SAIC monitors earthquakes in Eastern Europe using this approach.