Fault Tolerance

Definitions

  • Availability
    • Probability the system operates correctly at any given moment
  • Reliability
    • Ability to run correctly for a long interval of time
  • Safety
    • Failure to operate correctly does not lead to catastrophic failures
  • Maintainability
    • Ability to “easily” repair a failed system

Failure Models

  • Different types of failures

Two-Army Problem

Byzantine Agreement 

  • [Lamport et al. (1982)]
  • Goal
    • Each process learn the true values sent by correct processes
  • Assumption
    • Every message that is sent is delivered correctly
    • The receiver knows who sent the message
    • Message delivery time is bounded
  • Byzantine Agreement Result
    • In a system with m faulty processes agreement, agreement can be achieved only if there are 2m+1 functioning correctly.
    • Note
      • This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct process!

Byzantine General Problem

  • Phase 1: Generals announce their troop strengths to each other

  • Phase 2: Each general construct a vector with all troops

  • Phase 3: General send their vectors to each other and compute the majority voting

Reliable Group Communication

  • Reliable Multicast
    • All nonfaulty process which do not join/leave during communication receive the message

  • Atomic Multicast
    • All message are delivered in the same order to all processes

View Delivery

  • A view reflects current membership of group
  • A view is delivered when a membership change occurs and the application is notified of the change
    • Receiving a view is different from delivering a view
      • All members have to agree to the delivery of a review
  • View synchronous group communication
    • The delivery of a new view draws a conceptual line across the system and every message is either delivered on one side or the other of that line


Atomic Multicast

  • All message are delivered in the same order to “all” processes
  • Group view
    • The set of processes known by the sender when it multicast the message
  • Virtual synchronous multicast
    • A message multicast to a group view G is delivered to all nonfaulty process in G
      • If sender fails after sending the message, the message may be delivered to no one

Virtual Synchrony Implementation
  • Only stable messages are delivered
  • Stable message
    • A message received by all processes in the message’s group view
  • Assumptions (can be ensured by using TCP)
    • Point-to-point communication is reliable
    • Point-to-point communication ensures FIFO-ordering


Message Ordering

  • Total ordering does not imply causality or FIFO!


Leave a Reply