15 Failures

Partial Failures

Part of the system of failing, other machine might be doing work while critical component fails; ends up in a strange state

Fault Tolerance

Property of a distributed system to be operational when partial failure

Availability

property of a system to be used immediately

Reliability

property of a system to run without failure

Availability v.s. Reliability

Highly available, but unreliable - fails every minute but recovers immediately
Highly reliable, but not available - Doesn't fail for years, but when fails, takes months to recover

MTTF (Mean Time To Failure) - average time until a component fails
MTTR (Mean Time to Repair) - average time needed to repair a component
MTBF (Mean Time between Failure) - MTTF + MTTR

greater MTBF, more reliable
MTTF/MTBF = availability of the system

Detecting Failures

Heartbeats

"Are you alive?" from all to all

In synchronous system, where message delivery times are bounded, we can be sure that a
suspected failure is a real failure.

challenges:

Networks are partitioned, cannot all to all
Partially synchronous system, cannot tell if message didn't get through or machine fail
- Result in false positives - remedy: ask neighbors if they have the same view

Gossip

Addition to heartbeats, also information about which machines we think are up

Failure Masking by Redundacy

Time Redundancy - If operation fails, retransmit request to the server
Physical Redundancy - extra hardware as back-up

RPC Semantics with Failures

Failures in RPC:

The client crashes after sending request
The request message is lost from client to server
The server crashes before recieving request
The response message is lost from server to client.

M (Send reponse)
P (Perform operation)
C (Crash)

Pasted image 20251013172705.png
In No Setup is Exactly-Once Possible. However, when Always, we can acheive exactly-once if the system is idempotent