Distributed Systems Guide

This is a collection of notes covering distributed systems design, patterns, and architectural principles.

Fundamentals

What are Distributed Systems?

A distributed system is a collection of independent computers that appear to users as a single coherent system. The key characteristics include:

Multiple autonomous nodes working together
Communication over network (message passing)
Shared resources and coordinated actions
Fault tolerance and resilience
Scalability to handle increased load

The 8 Fallacies of Distributed Computing

Originally proposed by Peter Deutsch, these are common false assumptions developers make:

The network is reliable - Networks fail, packets get lost
Latency isn’t a problem - Network calls are orders of magnitude slower than local calls
Bandwidth isn’t a problem - Bandwidth is finite and expensive
The network is secure - Networks are inherently insecure
The topology won’t change - Network topology changes frequently
There is one administrator - Multiple administrators with different priorities
Transport cost isn’t a problem - Moving data costs time and money
The network is homogeneous - Different hardware, protocols, and vendors

CAP Theorem

The CAP theorem states that in a distributed system, you can only guarantee two of the following three properties:

Consistency: All nodes see the same data simultaneously
Availability: System remains operational all the time
Partition Tolerance: System continues despite network failures

Trade-offs:

CP Systems: Consistent and Partition-tolerant (e.g., HBase, MongoDB)
AP Systems: Available and Partition-tolerant (e.g., Cassandra, DynamoDB)
CA Systems: Consistent and Available (traditional RDBMS, but impractical in distributed environments)

Distributed System Patterns

Leader Election

When you have a group of servers providing a service, they can elect a leader to coordinate activities.

Benefits:

Centralized decision making
Avoid split-brain scenarios
Coordinate distributed operations

Common Algorithms:

Bully Algorithm: Highest ID wins
Ring Algorithm: Nodes arranged in logical ring
Raft: Modern consensus algorithm

L: Leader
F: Follower

       database
          |
    L ----------- F
    |             |
    F ----------- F

Implementation Considerations:

Leader failure detection
Re-election process
Preventing split-brain
Health checking mechanisms

Map-Reduce

Map-Reduce is a programming model for processing large datasets in parallel across distributed clusters.

Two Main Steps:

Map: Transform data into key/value pairs
Reduce: Aggregate key/value pairs into final results

Input Data → Map Phase → Shuffle/Sort → Reduce Phase → Output

Map (Document A) → (word, count) pairs
Map (Document B) → (word, count) pairs
Map (Document C) → (word, count) pairs
                      ↓
                 Shuffle/Group
                      ↓
Reduce (word, [counts]) → (word, total_count)

Example: Word Count

# Map function
def map_function(document):
    for word in document.split():
        emit(word.lower(), 1)

# Reduce function  
def reduce_function(word, counts):
    emit(word, sum(counts))

Use Cases:

Log analysis
Data transformation
Machine learning preprocessing
Large-scale data processing

Pub/Sub enables asynchronous communication between services through message brokers.

Publisher → Topic/Queue → Subscriber
          ↘              ↙
            Message Broker
          ↗              ↘
Publisher → Topic/Queue → Subscriber

Benefits:

Loose coupling between components
Asynchronous processing
Scalability through horizontal scaling
Fault tolerance through message persistence

Implementation Patterns:

# Publisher
def publish_event(topic, message):
    broker.publish(topic, {
        'event_type': 'user_created',
        'user_id': 12345,
        'timestamp': datetime.now(),
        'data': message
    })

# Subscriber
def handle_user_created(message):
    user_id = message['user_id']
    send_welcome_email(user_id)
    update_analytics(user_id)
    
broker.subscribe('user_events', handle_user_created)

Load Balancing

Distribute incoming requests across multiple servers to prevent overload.

Types of Load Balancing:

Round Robin

Request 1 → Server A
Request 2 → Server B  
Request 3 → Server C
Request 4 → Server A (cycle repeats)

Weighted Round Robin

Server A (weight: 3) gets 3 requests
Server B (weight: 2) gets 2 requests
Server C (weight: 1) gets 1 request

Least Connections

Route to server with fewest active connections

Health-based Routing

def route_request(request):
    healthy_servers = [s for s in servers if s.is_healthy()]
    if not healthy_servers:
        return error_response()
    
    return least_loaded_server(healthy_servers).handle(request)

Sharding

Distribute data across multiple databases to improve performance and scalability.

Sharding Strategies:

Hash-based Sharding

def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

# User 12345 goes to shard 1
# User 67890 goes to shard 2

Range-based Sharding

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000

Directory-based Sharding

shard_directory = {
    'users_1_1000': 'shard_1',
    'users_1001_2000': 'shard_2',
    'users_2001_3000': 'shard_3'
}

Challenges:

Cross-shard queries
Rebalancing data
Hotspots
Maintaining referential integrity

Consensus Algorithms

Raft Consensus

Raft is designed to be more understandable than Paxos while achieving the same goals.

Key Concepts:

Leader: Handles all client requests
Follower: Passive, responds to leader/candidate requests
Candidate: Seeks votes to become leader

Raft Process:

Leader Election: When no leader exists, followers become candidates
Log Replication: Leader replicates log entries to followers
Safety: Ensures consistency across the cluster

class RaftNode:
    def __init__(self, node_id):
        self.node_id = node_id
        self.state = 'follower'  # follower, candidate, leader
        self.current_term = 0
        self.voted_for = None
        self.log = []
        
    def request_vote(self, term, candidate_id):
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.state = 'follower'
            
        if (self.voted_for is None or 
            self.voted_for == candidate_id):
            self.voted_for = candidate_id
            return True
        return False

Byzantine Fault Tolerance

Handle nodes that may behave maliciously or send conflicting information.

Byzantine Generals Problem:

Multiple generals must coordinate attack
Some generals may be traitors
Need consensus despite malicious actors

Practical Byzantine Fault Tolerance (PBFT):

Tolerates up to (n-1)/3 Byzantine failures
Requires 3f+1 nodes to tolerate f failures
Used in blockchain and cryptocurrency systems

Consistency Models

Strong Consistency

All nodes see the same data at the same time.

# Example: Banking system
def transfer_money(from_account, to_account, amount):
    with transaction():
        debit(from_account, amount)
        credit(to_account, amount)
    # All nodes see consistent state immediately

Eventual Consistency

System becomes consistent over time, but may be inconsistent temporarily.

# Example: Social media likes
def like_post(user_id, post_id):
    # Update local datacenter immediately
    local_db.increment_likes(post_id)
    
    # Propagate to other datacenters asynchronously
    async_replicate_to_all_datacenters(post_id, increment=1)

Causal Consistency

Maintains order of causally related operations.

User A posts message → User B replies to message
(This order must be preserved across all nodes)

Fault Tolerance Patterns

Circuit Breaker

Prevent cascading failures by detecting faults and encapsulating the logic for handling them.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
        
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenException()
                
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e
            
    def on_success(self):
        self.failure_count = 0
        self.state = 'CLOSED'
        
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = 'OPEN'

Bulkhead

Isolate critical resources to prevent failure cascade.

# Separate connection pools for different services
class ServiceClients:
    def __init__(self):
        self.user_service_pool = ConnectionPool(
            max_connections=10,
            service='user-service'
        )
        self.order_service_pool = ConnectionPool(
            max_connections=5,
            service='order-service'
        )
        self.payment_service_pool = ConnectionPool(
            max_connections=3,
            service='payment-service'
        )

Timeout and Retry

Handle temporary failures with appropriate timeouts and retry logic.

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TemporaryException as e:
            if attempt == max_retries - 1:
                raise e
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Monitoring and Observability

The Three Pillars

Metrics

Numerical measurements over time:

# Example metrics
request_count.increment()
response_time.record(duration)
error_rate.set(errors / total_requests)
active_connections.gauge(current_connections)

Logs

Discrete events with context:

logger.info("User login successful", {
    'user_id': user.id,
    'ip_address': request.ip,
    'timestamp': datetime.now(),
    'session_id': session.id
})

Traces

Request paths through distributed systems:

@trace_function
def process_order(order_id):
    user = get_user(order.user_id)  # Span 1
    validate_payment(order.payment)  # Span 2
    update_inventory(order.items)    # Span 3
    send_confirmation(user.email)    # Span 4

Health Checks

Implement comprehensive health checks for all services:

class HealthChecker:
    def __init__(self):
        self.checks = []
        
    def add_check(self, name, check_func):
        self.checks.append((name, check_func))
        
    def get_health_status(self):
        results = {}
        overall_healthy = True
        
        for name, check_func in self.checks:
            try:
                result = check_func()
                results[name] = {
                    'status': 'healthy' if result else 'unhealthy',
                    'timestamp': datetime.now()
                }
                if not result:
                    overall_healthy = False
            except Exception as e:
                results[name] = {
                    'status': 'error',
                    'error': str(e),
                    'timestamp': datetime.now()
                }
                overall_healthy = False
                
        return {
            'overall_status': 'healthy' if overall_healthy else 'unhealthy',
            'checks': results
        }

# Usage
health_checker = HealthChecker()
health_checker.add_check('database', lambda: db.ping())
health_checker.add_check('cache', lambda: redis.ping())
health_checker.add_check('external_api', lambda: api_client.health_check())

Best Practices

Design Principles

Design for Failure: Assume components will fail
Loose Coupling: Minimize dependencies between services
High Cohesion: Keep related functionality together
Idempotency: Operations should be safely retryable
Graceful Degradation: Maintain core functionality during failures

Performance Considerations

Minimize Network Calls: Batch operations when possible
Cache Frequently Used Data: Reduce database load
Asynchronous Processing: Don’t block user operations
Connection Pooling: Reuse expensive connections
Data Locality: Keep related data close together

Security

Defense in Depth: Multiple layers of security
Principle of Least Privilege: Minimal necessary permissions
Encrypt in Transit and at Rest: Protect sensitive data
Input Validation: Sanitize all external inputs
Audit Logging: Track all security-relevant events

Operational Excellence

Infrastructure as Code: Version control your infrastructure
Continuous Integration/Deployment: Automate deployments
Chaos Engineering: Test failure scenarios regularly
Capacity Planning: Plan for growth and traffic spikes
Disaster Recovery: Have plans and test them regularly

Conclusion

Distributed systems are complex but essential for building scalable, resilient applications. Key takeaways:

Understand the trade-offs between consistency, availability, and partition tolerance
Design for failure from the beginning
Use proven patterns and algorithms
Monitor everything and plan for operational excellence
Start simple and add complexity only when needed

Success in distributed systems comes from understanding these fundamental concepts and applying them thoughtfully to solve real-world problems.