Unit 5

Unit 5: Fault Tolerance and Consensus - Complete Notes

1. Introduction to Fault Tolerance

Definition:

Fault Tolerance is the ability of a system to continue functioning correctly even in the presence of faults (e.g., hardware failures, software bugs, network issues).

Why It’s Important:

Ensures system reliability and availability.
Example: A banking system must continue working even if a server fails.

Key Concepts:

Fault:
- A defect or failure in a system component.
- Example: A crashed server or a broken network link.
Error:
- A deviation from correct behavior due to a fault.
- Example: Incorrect data due to a faulty sensor.
Failure:
- The system’s inability to perform its intended function.
- Example: A website going offline due to a server crash.

Techniques for Fault Tolerance:

Redundancy:
- Adding extra components to ensure system functionality.
- Example: Replicating data across multiple servers.
Checkpointing:
- Periodically saving the system state to recover from failures.
- Example: Database systems saving transaction logs.
Recovery:
- Restoring the system to a correct state after a failure.
- Example: Rebooting a server after a crash.

Mind Map for Introduction to Fault Tolerance:

Fault Tolerance  
├── Definition  
│   └── System functioning despite faults  
├── Key Concepts  
│   ├── Fault (Example: Crashed server)  
│   ├── Error (Example: Incorrect data)  
│   └── Failure (Example: Website offline)  
└── Techniques  
    ├── Redundancy (Example: Data replication)  
    ├── Checkpointing (Example: Transaction logs)  
    └── Recovery (Example: Server reboot)

2. Distributed Commit Protocols

Definition:

Distributed Commit Protocols ensure that a distributed transaction is either fully committed or fully rolled back across all nodes.

Why It’s Important:

Ensures data consistency in distributed systems.
Example: A distributed database must ensure that all nodes agree on whether to commit or abort a transaction.

Types of Commit Protocols:

Two-Phase Commit (2PC):
- A coordinator ensures all participants agree to commit or abort.
- Phases:
  1. Prepare Phase: Coordinator asks participants if they are ready to commit.
  2. Commit Phase: If all participants agree, the coordinator sends a commit request.
- Example: Distributed databases like MySQL Cluster.
Three-Phase Commit (3PC):
- Adds a pre-commit phase to reduce blocking in case of failures.
- Phases:
  1. Prepare Phase: Same as 2PC.
  2. Pre-Commit Phase: Coordinator ensures all participants are ready to commit.
  3. Commit Phase: Coordinator sends the final commit request.
- Example: High-availability systems like Google Spanner.

Comparison:

Feature

Two-Phase Commit (2PC)

Three-Phase Commit (3PC)

Phases

2 (Prepare, Commit)

3 (Prepare, Pre-Commit, Commit)

Blocking

Can block if coordinator fails

Non-blocking due to pre-commit phase

Use Case

Simple distributed systems

High-availability systems

Mind Map for Distributed Commit Protocols:

Distributed Commit Protocols  
├── Two-Phase Commit (2PC)  
│   ├── Phases: Prepare, Commit  
│   └── Example: MySQL Cluster  
├── Three-Phase Commit (3PC)  
│   ├── Phases: Prepare, Pre-Commit, Commit  
│   └── Example: Google Spanner  
└── Comparison  
    ├── Phases, Blocking, Use Case

3. Byzantine Fault Tolerance

Definition:

Byzantine Fault Tolerance (BFT) is the ability of a system to tolerate arbitrary (malicious) faults, where nodes may behave unpredictably or maliciously.

Why It’s Important:

Ensures system reliability even in the presence of malicious nodes.
Example: Blockchain systems must tolerate malicious nodes trying to disrupt the network.

How It Works:

Consensus Algorithms:
- Nodes agree on a single value despite malicious behavior.
- Example: Practical Byzantine Fault Tolerance (PBFT).
Redundancy:
- Multiple nodes verify and agree on the correctness of data.
- Example: Replicating transactions across multiple nodes in a blockchain.

Real-World Example:

Blockchain Networks: Bitcoin and Ethereum use BFT to ensure consensus among nodes, even if some nodes are malicious.

Mind Map for Byzantine Fault Tolerance:

Byzantine Fault Tolerance (BFT)  
├── Definition  
│   └── Tolerates arbitrary/malicious faults  
├── How It Works  
│   ├── Consensus Algorithms (Example: PBFT)  
│   └── Redundancy (Example: Blockchain)  
└── Real-World Example  
    └── Blockchain Networks (Bitcoin, Ethereum)

4. Impossibilities in Fault Tolerance

Definition:

Impossibilities in Fault Tolerance refer to theoretical limits on what can be achieved in fault-tolerant systems.

Key Impossibility Results:

FLP Impossibility:
- It is impossible to achieve consensus in an asynchronous distributed system with even a single faulty process.
- Implication: No algorithm can guarantee consensus in all cases.
- Example: In a network with delays, consensus may not always be reached.
CAP Theorem:
- A distributed system cannot simultaneously provide Consistency, Availability, and Partition Tolerance.
- Implication: Systems must trade off between these properties.
- Example:
  - CP Systems: Prioritize Consistency and Partition Tolerance (e.g., distributed databases).
  - AP Systems: Prioritize Availability and Partition Tolerance (e.g., NoSQL databases).

Real-World Example:

Distributed Databases: MongoDB (AP system) prioritizes availability, while PostgreSQL (CP system) prioritizes consistency.

Mind Map for Impossibilities in Fault Tolerance:

Impossibilities in Fault Tolerance  
├── FLP Impossibility  
│   ├── No consensus in asynchronous systems with faults  
│   └── Example: Network delays  
├── CAP Theorem  
│   ├── Trade-off between Consistency, Availability, Partition Tolerance  
│   └── Example: MongoDB (AP), PostgreSQL (CP)  
└── Real-World Example  
    └── Distributed Databases

Summary of Unit 5: Fault Tolerance and Consensus

Fault Tolerance ensures system reliability despite faults, using techniques like redundancy and recovery.
Distributed Commit Protocols (2PC, 3PC) ensure atomicity in distributed transactions.
Byzantine Fault Tolerance handles malicious behavior in distributed systems.
Impossibility Results (FLP, CAP) define theoretical limits in fault-tolerant systems.

Final Mind Map for Unit 5:

Unit 5: Fault Tolerance and Consensus  
├── Fault Tolerance  
│   ├── Definition, Key Concepts (Fault, Error, Failure)  
│   └── Techniques (Redundancy, Checkpointing, Recovery)  
├── Distributed Commit Protocols  
│   ├── Two-Phase Commit (2PC)  
│   ├── Three-Phase Commit (3PC)  
│   └── Comparison (Phases, Blocking, Use Case)  
├── Byzantine Fault Tolerance (BFT)  
│   ├── Definition, How It Works (Consensus, Redundancy)  
│   └── Real-World Example (Blockchain Networks)  
└── Impossibilities in Fault Tolerance  
    ├── FLP Impossibility (No consensus in asynchronous systems)  
    ├── CAP Theorem (Trade-off: Consistency, Availability, Partition Tolerance)  
    └── Real-World Example (Distributed Databases)

PreviousUnit 4 Next1’s and 2’s Complement

Last updated 6 months ago