Unit 5
Unit 5: Fault Tolerance and Consensus - Complete Notes
1. Introduction to Fault Tolerance
Definition:
Fault Tolerance is the ability of a system to continue functioning correctly even in the presence of faults (e.g., hardware failures, software bugs, network issues).
Why It’s Important:
Ensures system reliability and availability.
Example: A banking system must continue working even if a server fails.
Key Concepts:
Fault:
A defect or failure in a system component.
Example: A crashed server or a broken network link.
Error:
A deviation from correct behavior due to a fault.
Example: Incorrect data due to a faulty sensor.
Failure:
The system’s inability to perform its intended function.
Example: A website going offline due to a server crash.
Techniques for Fault Tolerance:
Redundancy:
Adding extra components to ensure system functionality.
Example: Replicating data across multiple servers.
Checkpointing:
Periodically saving the system state to recover from failures.
Example: Database systems saving transaction logs.
Recovery:
Restoring the system to a correct state after a failure.
Example: Rebooting a server after a crash.
Mind Map for Introduction to Fault Tolerance:
2. Distributed Commit Protocols
Definition:
Distributed Commit Protocols ensure that a distributed transaction is either fully committed or fully rolled back across all nodes.
Why It’s Important:
Ensures data consistency in distributed systems.
Example: A distributed database must ensure that all nodes agree on whether to commit or abort a transaction.
Types of Commit Protocols:
Two-Phase Commit (2PC):
A coordinator ensures all participants agree to commit or abort.
Phases:
Prepare Phase: Coordinator asks participants if they are ready to commit.
Commit Phase: If all participants agree, the coordinator sends a commit request.
Example: Distributed databases like MySQL Cluster.
Three-Phase Commit (3PC):
Adds a pre-commit phase to reduce blocking in case of failures.
Phases:
Prepare Phase: Same as 2PC.
Pre-Commit Phase: Coordinator ensures all participants are ready to commit.
Commit Phase: Coordinator sends the final commit request.
Example: High-availability systems like Google Spanner.
Comparison:
Feature
Two-Phase Commit (2PC)
Three-Phase Commit (3PC)
Phases
2 (Prepare, Commit)
3 (Prepare, Pre-Commit, Commit)
Blocking
Can block if coordinator fails
Non-blocking due to pre-commit phase
Use Case
Simple distributed systems
High-availability systems
Mind Map for Distributed Commit Protocols:
3. Byzantine Fault Tolerance
Definition:
Byzantine Fault Tolerance (BFT) is the ability of a system to tolerate arbitrary (malicious) faults, where nodes may behave unpredictably or maliciously.
Why It’s Important:
Ensures system reliability even in the presence of malicious nodes.
Example: Blockchain systems must tolerate malicious nodes trying to disrupt the network.
How It Works:
Consensus Algorithms:
Nodes agree on a single value despite malicious behavior.
Example: Practical Byzantine Fault Tolerance (PBFT).
Redundancy:
Multiple nodes verify and agree on the correctness of data.
Example: Replicating transactions across multiple nodes in a blockchain.
Real-World Example:
Blockchain Networks: Bitcoin and Ethereum use BFT to ensure consensus among nodes, even if some nodes are malicious.
Mind Map for Byzantine Fault Tolerance:
4. Impossibilities in Fault Tolerance
Definition:
Impossibilities in Fault Tolerance refer to theoretical limits on what can be achieved in fault-tolerant systems.
Key Impossibility Results:
FLP Impossibility:
It is impossible to achieve consensus in an asynchronous distributed system with even a single faulty process.
Implication: No algorithm can guarantee consensus in all cases.
Example: In a network with delays, consensus may not always be reached.
CAP Theorem:
A distributed system cannot simultaneously provide Consistency, Availability, and Partition Tolerance.
Implication: Systems must trade off between these properties.
Example:
CP Systems: Prioritize Consistency and Partition Tolerance (e.g., distributed databases).
AP Systems: Prioritize Availability and Partition Tolerance (e.g., NoSQL databases).
Real-World Example:
Distributed Databases: MongoDB (AP system) prioritizes availability, while PostgreSQL (CP system) prioritizes consistency.
Mind Map for Impossibilities in Fault Tolerance:
Summary of Unit 5: Fault Tolerance and Consensus
Fault Tolerance ensures system reliability despite faults, using techniques like redundancy and recovery.
Distributed Commit Protocols (2PC, 3PC) ensure atomicity in distributed transactions.
Byzantine Fault Tolerance handles malicious behavior in distributed systems.
Impossibility Results (FLP, CAP) define theoretical limits in fault-tolerant systems.
Final Mind Map for Unit 5:
Last updated