Unit 5

Unit 5: Fault Tolerance and Consensus - Complete Notes


1. Introduction to Fault Tolerance

Definition:

  • Fault Tolerance is the ability of a system to continue functioning correctly even in the presence of faults (e.g., hardware failures, software bugs, network issues).

Why It’s Important:

  • Ensures system reliability and availability.

  • Example: A banking system must continue working even if a server fails.

Key Concepts:

  1. Fault:

    • A defect or failure in a system component.

    • Example: A crashed server or a broken network link.

  2. Error:

    • A deviation from correct behavior due to a fault.

    • Example: Incorrect data due to a faulty sensor.

  3. Failure:

    • The system’s inability to perform its intended function.

    • Example: A website going offline due to a server crash.

Techniques for Fault Tolerance:

  1. Redundancy:

    • Adding extra components to ensure system functionality.

    • Example: Replicating data across multiple servers.

  2. Checkpointing:

    • Periodically saving the system state to recover from failures.

    • Example: Database systems saving transaction logs.

  3. Recovery:

    • Restoring the system to a correct state after a failure.

    • Example: Rebooting a server after a crash.


Mind Map for Introduction to Fault Tolerance:


2. Distributed Commit Protocols

Definition:

  • Distributed Commit Protocols ensure that a distributed transaction is either fully committed or fully rolled back across all nodes.

Why It’s Important:

  • Ensures data consistency in distributed systems.

  • Example: A distributed database must ensure that all nodes agree on whether to commit or abort a transaction.

Types of Commit Protocols:

  1. Two-Phase Commit (2PC):

    • A coordinator ensures all participants agree to commit or abort.

    • Phases:

      1. Prepare Phase: Coordinator asks participants if they are ready to commit.

      2. Commit Phase: If all participants agree, the coordinator sends a commit request.

    • Example: Distributed databases like MySQL Cluster.

  2. Three-Phase Commit (3PC):

    • Adds a pre-commit phase to reduce blocking in case of failures.

    • Phases:

      1. Prepare Phase: Same as 2PC.

      2. Pre-Commit Phase: Coordinator ensures all participants are ready to commit.

      3. Commit Phase: Coordinator sends the final commit request.

    • Example: High-availability systems like Google Spanner.

Comparison:

Feature

Two-Phase Commit (2PC)

Three-Phase Commit (3PC)

Phases

2 (Prepare, Commit)

3 (Prepare, Pre-Commit, Commit)

Blocking

Can block if coordinator fails

Non-blocking due to pre-commit phase

Use Case

Simple distributed systems

High-availability systems


Mind Map for Distributed Commit Protocols:


3. Byzantine Fault Tolerance

Definition:

  • Byzantine Fault Tolerance (BFT) is the ability of a system to tolerate arbitrary (malicious) faults, where nodes may behave unpredictably or maliciously.

Why It’s Important:

  • Ensures system reliability even in the presence of malicious nodes.

  • Example: Blockchain systems must tolerate malicious nodes trying to disrupt the network.

How It Works:

  1. Consensus Algorithms:

    • Nodes agree on a single value despite malicious behavior.

    • Example: Practical Byzantine Fault Tolerance (PBFT).

  2. Redundancy:

    • Multiple nodes verify and agree on the correctness of data.

    • Example: Replicating transactions across multiple nodes in a blockchain.

Real-World Example:

  • Blockchain Networks: Bitcoin and Ethereum use BFT to ensure consensus among nodes, even if some nodes are malicious.


Mind Map for Byzantine Fault Tolerance:


4. Impossibilities in Fault Tolerance

Definition:

  • Impossibilities in Fault Tolerance refer to theoretical limits on what can be achieved in fault-tolerant systems.

Key Impossibility Results:

  1. FLP Impossibility:

    • It is impossible to achieve consensus in an asynchronous distributed system with even a single faulty process.

    • Implication: No algorithm can guarantee consensus in all cases.

    • Example: In a network with delays, consensus may not always be reached.

  2. CAP Theorem:

    • A distributed system cannot simultaneously provide Consistency, Availability, and Partition Tolerance.

    • Implication: Systems must trade off between these properties.

    • Example:

      • CP Systems: Prioritize Consistency and Partition Tolerance (e.g., distributed databases).

      • AP Systems: Prioritize Availability and Partition Tolerance (e.g., NoSQL databases).

Real-World Example:

  • Distributed Databases: MongoDB (AP system) prioritizes availability, while PostgreSQL (CP system) prioritizes consistency.


Mind Map for Impossibilities in Fault Tolerance:


Summary of Unit 5: Fault Tolerance and Consensus

  • Fault Tolerance ensures system reliability despite faults, using techniques like redundancy and recovery.

  • Distributed Commit Protocols (2PC, 3PC) ensure atomicity in distributed transactions.

  • Byzantine Fault Tolerance handles malicious behavior in distributed systems.

  • Impossibility Results (FLP, CAP) define theoretical limits in fault-tolerant systems.


Final Mind Map for Unit 5:


Last updated