Unit 5

Unit 5: Fault Tolerance and Consensus - Complete Notes


1. Introduction to Fault Tolerance

Definition:

  • Fault Tolerance is the ability of a system to continue functioning correctly even in the presence of faults (e.g., hardware failures, software bugs, network issues).

Why It’s Important:

  • Ensures system reliability and availability.

  • Example: A banking system must continue working even if a server fails.

Key Concepts:

  1. Fault:

    • A defect or failure in a system component.

    • Example: A crashed server or a broken network link.

  2. Error:

    • A deviation from correct behavior due to a fault.

    • Example: Incorrect data due to a faulty sensor.

  3. Failure:

    • The system’s inability to perform its intended function.

    • Example: A website going offline due to a server crash.

Techniques for Fault Tolerance:

  1. Redundancy:

    • Adding extra components to ensure system functionality.

    • Example: Replicating data across multiple servers.

  2. Checkpointing:

    • Periodically saving the system state to recover from failures.

    • Example: Database systems saving transaction logs.

  3. Recovery:

    • Restoring the system to a correct state after a failure.

    • Example: Rebooting a server after a crash.


Mind Map for Introduction to Fault Tolerance:

Fault Tolerance  
├── Definition  
│   └── System functioning despite faults  
├── Key Concepts  
│   ├── Fault (Example: Crashed server)  
│   ├── Error (Example: Incorrect data)  
│   └── Failure (Example: Website offline)  
└── Techniques  
    ├── Redundancy (Example: Data replication)  
    ├── Checkpointing (Example: Transaction logs)  
    └── Recovery (Example: Server reboot)

2. Distributed Commit Protocols

Definition:

  • Distributed Commit Protocols ensure that a distributed transaction is either fully committed or fully rolled back across all nodes.

Why It’s Important:

  • Ensures data consistency in distributed systems.

  • Example: A distributed database must ensure that all nodes agree on whether to commit or abort a transaction.

Types of Commit Protocols:

  1. Two-Phase Commit (2PC):

    • A coordinator ensures all participants agree to commit or abort.

    • Phases:

      1. Prepare Phase: Coordinator asks participants if they are ready to commit.

      2. Commit Phase: If all participants agree, the coordinator sends a commit request.

    • Example: Distributed databases like MySQL Cluster.

  2. Three-Phase Commit (3PC):

    • Adds a pre-commit phase to reduce blocking in case of failures.

    • Phases:

      1. Prepare Phase: Same as 2PC.

      2. Pre-Commit Phase: Coordinator ensures all participants are ready to commit.

      3. Commit Phase: Coordinator sends the final commit request.

    • Example: High-availability systems like Google Spanner.

Comparison:

Feature

Two-Phase Commit (2PC)

Three-Phase Commit (3PC)

Phases

2 (Prepare, Commit)

3 (Prepare, Pre-Commit, Commit)

Blocking

Can block if coordinator fails

Non-blocking due to pre-commit phase

Use Case

Simple distributed systems

High-availability systems


Mind Map for Distributed Commit Protocols:

Distributed Commit Protocols  
├── Two-Phase Commit (2PC)  
│   ├── Phases: Prepare, Commit  
│   └── Example: MySQL Cluster  
├── Three-Phase Commit (3PC)  
│   ├── Phases: Prepare, Pre-Commit, Commit  
│   └── Example: Google Spanner  
└── Comparison  
    ├── Phases, Blocking, Use Case

3. Byzantine Fault Tolerance

Definition:

  • Byzantine Fault Tolerance (BFT) is the ability of a system to tolerate arbitrary (malicious) faults, where nodes may behave unpredictably or maliciously.

Why It’s Important:

  • Ensures system reliability even in the presence of malicious nodes.

  • Example: Blockchain systems must tolerate malicious nodes trying to disrupt the network.

How It Works:

  1. Consensus Algorithms:

    • Nodes agree on a single value despite malicious behavior.

    • Example: Practical Byzantine Fault Tolerance (PBFT).

  2. Redundancy:

    • Multiple nodes verify and agree on the correctness of data.

    • Example: Replicating transactions across multiple nodes in a blockchain.

Real-World Example:

  • Blockchain Networks: Bitcoin and Ethereum use BFT to ensure consensus among nodes, even if some nodes are malicious.


Mind Map for Byzantine Fault Tolerance:

Byzantine Fault Tolerance (BFT)  
├── Definition  
│   └── Tolerates arbitrary/malicious faults  
├── How It Works  
│   ├── Consensus Algorithms (Example: PBFT)  
│   └── Redundancy (Example: Blockchain)  
└── Real-World Example  
    └── Blockchain Networks (Bitcoin, Ethereum)

4. Impossibilities in Fault Tolerance

Definition:

  • Impossibilities in Fault Tolerance refer to theoretical limits on what can be achieved in fault-tolerant systems.

Key Impossibility Results:

  1. FLP Impossibility:

    • It is impossible to achieve consensus in an asynchronous distributed system with even a single faulty process.

    • Implication: No algorithm can guarantee consensus in all cases.

    • Example: In a network with delays, consensus may not always be reached.

  2. CAP Theorem:

    • A distributed system cannot simultaneously provide Consistency, Availability, and Partition Tolerance.

    • Implication: Systems must trade off between these properties.

    • Example:

      • CP Systems: Prioritize Consistency and Partition Tolerance (e.g., distributed databases).

      • AP Systems: Prioritize Availability and Partition Tolerance (e.g., NoSQL databases).

Real-World Example:

  • Distributed Databases: MongoDB (AP system) prioritizes availability, while PostgreSQL (CP system) prioritizes consistency.


Mind Map for Impossibilities in Fault Tolerance:

Impossibilities in Fault Tolerance  
├── FLP Impossibility  
│   ├── No consensus in asynchronous systems with faults  
│   └── Example: Network delays  
├── CAP Theorem  
│   ├── Trade-off between Consistency, Availability, Partition Tolerance  
│   └── Example: MongoDB (AP), PostgreSQL (CP)  
└── Real-World Example  
    └── Distributed Databases

Summary of Unit 5: Fault Tolerance and Consensus

  • Fault Tolerance ensures system reliability despite faults, using techniques like redundancy and recovery.

  • Distributed Commit Protocols (2PC, 3PC) ensure atomicity in distributed transactions.

  • Byzantine Fault Tolerance handles malicious behavior in distributed systems.

  • Impossibility Results (FLP, CAP) define theoretical limits in fault-tolerant systems.


Final Mind Map for Unit 5:

Unit 5: Fault Tolerance and Consensus  
├── Fault Tolerance  
│   ├── Definition, Key Concepts (Fault, Error, Failure)  
│   └── Techniques (Redundancy, Checkpointing, Recovery)  
├── Distributed Commit Protocols  
│   ├── Two-Phase Commit (2PC)  
│   ├── Three-Phase Commit (3PC)  
│   └── Comparison (Phases, Blocking, Use Case)  
├── Byzantine Fault Tolerance (BFT)  
│   ├── Definition, How It Works (Consensus, Redundancy)  
│   └── Real-World Example (Blockchain Networks)  
└── Impossibilities in Fault Tolerance  
    ├── FLP Impossibility (No consensus in asynchronous systems)  
    ├── CAP Theorem (Trade-off: Consistency, Availability, Partition Tolerance)  
    └── Real-World Example (Distributed Databases)

Last updated