~/src/www.mokhan.ca/xlgmokha [main]
cat distributed-systems-guide.md
distributed-systems-guide.md 42145 bytes | 2020-09-01 12:00
symlink: /dev/eng/distributed-systems-guide.md

Distributed Systems Guide

This is a collection of notes covering distributed systems design, patterns, and architectural principles.

Fundamentals

What are Distributed Systems?

A distributed system is a collection of independent computers that appear to users as a single coherent system. The key characteristics include:

  • Multiple autonomous nodes working together
  • Communication over network (message passing)
  • Shared resources and coordinated actions
  • Fault tolerance and resilience
  • Scalability to handle increased load

The 8 Fallacies of Distributed Computing

Originally proposed by Peter Deutsch, these are common false assumptions developers make:

  1. The network is reliable - Networks fail, packets get lost
  2. Latency isn’t a problem - Network calls are orders of magnitude slower than local calls
  3. Bandwidth isn’t a problem - Bandwidth is finite and expensive
  4. The network is secure - Networks are inherently insecure
  5. The topology won’t change - Network topology changes frequently
  6. There is one administrator - Multiple administrators with different priorities
  7. Transport cost isn’t a problem - Moving data costs time and money
  8. The network is homogeneous - Different hardware, protocols, and vendors

CAP Theorem

The CAP theorem states that in a distributed system, you can only guarantee two of the following three properties:

  • Consistency: All nodes see the same data simultaneously
  • Availability: System remains operational all the time
  • Partition Tolerance: System continues despite network failures

Trade-offs:

  • CP Systems: Consistent and Partition-tolerant (e.g., HBase, MongoDB)
  • AP Systems: Available and Partition-tolerant (e.g., Cassandra, DynamoDB)
  • CA Systems: Consistent and Available (traditional RDBMS, but impractical in distributed environments)

Distributed System Patterns

Leader Election

When you have a group of servers providing a service, they can elect a leader to coordinate activities.

Benefits:

  • Centralized decision making
  • Avoid split-brain scenarios
  • Coordinate distributed operations

Common Algorithms:

  • Bully Algorithm: Highest ID wins
  • Ring Algorithm: Nodes arranged in logical ring
  • Raft: Modern consensus algorithm
L: Leader
F: Follower

       database
          |
    L ----------- F
    |             |
    F ----------- F

Implementation Considerations:

  • Leader failure detection
  • Re-election process
  • Preventing split-brain
  • Health checking mechanisms

Map-Reduce

Map-Reduce is a programming model for processing large datasets in parallel across distributed clusters.

Two Main Steps:

  1. Map: Transform data into key/value pairs
  2. Reduce: Aggregate key/value pairs into final results
Input Data → Map Phase → Shuffle/Sort → Reduce Phase → Output

Map (Document A) → (word, count) pairs
Map (Document B) → (word, count) pairs
Map (Document C) → (word, count) pairs
                      ↓
                 Shuffle/Group
                      ↓
Reduce (word, [counts]) → (word, total_count)

Example: Word Count

# Map function
def map_function(document):
    for word in document.split():
        emit(word.lower(), 1)

# Reduce function  
def reduce_function(word, counts):
    emit(word, sum(counts))

Use Cases:

  • Log analysis
  • Data transformation
  • Machine learning preprocessing
  • Large-scale data processing

Pub/Sub (Publish-Subscribe)

Pub/Sub enables asynchronous communication between services through message brokers.

Publisher → Topic/Queue → Subscriber
          ↘              ↙
            Message Broker
          ↗              ↘
Publisher → Topic/Queue → Subscriber

Benefits:

  • Loose coupling between components
  • Asynchronous processing
  • Scalability through horizontal scaling
  • Fault tolerance through message persistence

Implementation Patterns:

# Publisher
def publish_event(topic, message):
    broker.publish(topic, {
        'event_type': 'user_created',
        'user_id': 12345,
        'timestamp': datetime.now(),
        'data': message
    })

# Subscriber
def handle_user_created(message):
    user_id = message['user_id']
    send_welcome_email(user_id)
    update_analytics(user_id)
    
broker.subscribe('user_events', handle_user_created)

Load Balancing

Distribute incoming requests across multiple servers to prevent overload.

Types of Load Balancing:

Round Robin

Request 1 → Server A
Request 2 → Server B  
Request 3 → Server C
Request 4 → Server A (cycle repeats)

Weighted Round Robin

Server A (weight: 3) gets 3 requests
Server B (weight: 2) gets 2 requests
Server C (weight: 1) gets 1 request

Least Connections

Route to server with fewest active connections

Health-based Routing

def route_request(request):
    healthy_servers = [s for s in servers if s.is_healthy()]
    if not healthy_servers:
        return error_response()
    
    return least_loaded_server(healthy_servers).handle(request)

Sharding

Distribute data across multiple databases to improve performance and scalability.

Sharding Strategies:

Hash-based Sharding

def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

# User 12345 goes to shard 1
# User 67890 goes to shard 2

Range-based Sharding

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000

Directory-based Sharding

shard_directory = {
    'users_1_1000': 'shard_1',
    'users_1001_2000': 'shard_2',
    'users_2001_3000': 'shard_3'
}

Challenges:

  • Cross-shard queries
  • Rebalancing data
  • Hotspots
  • Maintaining referential integrity

Consensus Algorithms

Raft Consensus

Raft is designed to be more understandable than Paxos while achieving the same goals.

Key Concepts:

  • Leader: Handles all client requests
  • Follower: Passive, responds to leader/candidate requests
  • Candidate: Seeks votes to become leader

Raft Process:

  1. Leader Election: When no leader exists, followers become candidates
  2. Log Replication: Leader replicates log entries to followers
  3. Safety: Ensures consistency across the cluster
class RaftNode:
    def __init__(self, node_id):
        self.node_id = node_id
        self.state = 'follower'  # follower, candidate, leader
        self.current_term = 0
        self.voted_for = None
        self.log = []
        
    def request_vote(self, term, candidate_id):
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.state = 'follower'
            
        if (self.voted_for is None or 
            self.voted_for == candidate_id):
            self.voted_for = candidate_id
            return True
        return False

Byzantine Fault Tolerance

Handle nodes that may behave maliciously or send conflicting information.

Byzantine Generals Problem:

  • Multiple generals must coordinate attack
  • Some generals may be traitors
  • Need consensus despite malicious actors

Practical Byzantine Fault Tolerance (PBFT):

  • Tolerates up to (n-1)/3 Byzantine failures
  • Requires 3f+1 nodes to tolerate f failures
  • Used in blockchain and cryptocurrency systems

Consistency Models

Strong Consistency

All nodes see the same data at the same time.

# Example: Banking system
def transfer_money(from_account, to_account, amount):
    with transaction():
        debit(from_account, amount)
        credit(to_account, amount)
    # All nodes see consistent state immediately

Eventual Consistency

System becomes consistent over time, but may be inconsistent temporarily.

# Example: Social media likes
def like_post(user_id, post_id):
    # Update local datacenter immediately
    local_db.increment_likes(post_id)
    
    # Propagate to other datacenters asynchronously
    async_replicate_to_all_datacenters(post_id, increment=1)

Causal Consistency

Maintains order of causally related operations.

User A posts message → User B replies to message
(This order must be preserved across all nodes)

Fault Tolerance Patterns

Circuit Breaker

Prevent cascading failures by detecting faults and encapsulating the logic for handling them.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
        
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenException()
                
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e
            
    def on_success(self):
        self.failure_count = 0
        self.state = 'CLOSED'
        
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = 'OPEN'

Bulkhead

Isolate critical resources to prevent failure cascade.

# Separate connection pools for different services
class ServiceClients:
    def __init__(self):
        self.user_service_pool = ConnectionPool(
            max_connections=10,
            service='user-service'
        )
        self.order_service_pool = ConnectionPool(
            max_connections=5,
            service='order-service'
        )
        self.payment_service_pool = ConnectionPool(
            max_connections=3,
            service='payment-service'
        )

Timeout and Retry

Handle temporary failures with appropriate timeouts and retry logic.

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TemporaryException as e:
            if attempt == max_retries - 1:
                raise e
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Monitoring and Observability

The Three Pillars

Metrics

Numerical measurements over time:

# Example metrics
request_count.increment()
response_time.record(duration)
error_rate.set(errors / total_requests)
active_connections.gauge(current_connections)

Logs

Discrete events with context:

logger.info("User login successful", {
    'user_id': user.id,
    'ip_address': request.ip,
    'timestamp': datetime.now(),
    'session_id': session.id
})

Traces

Request paths through distributed systems:

@trace_function
def process_order(order_id):
    user = get_user(order.user_id)  # Span 1
    validate_payment(order.payment)  # Span 2
    update_inventory(order.items)    # Span 3
    send_confirmation(user.email)    # Span 4

Health Checks

Implement comprehensive health checks for all services:

class HealthChecker:
    def __init__(self):
        self.checks = []
        
    def add_check(self, name, check_func):
        self.checks.append((name, check_func))
        
    def get_health_status(self):
        results = {}
        overall_healthy = True
        
        for name, check_func in self.checks:
            try:
                result = check_func()
                results[name] = {
                    'status': 'healthy' if result else 'unhealthy',
                    'timestamp': datetime.now()
                }
                if not result:
                    overall_healthy = False
            except Exception as e:
                results[name] = {
                    'status': 'error',
                    'error': str(e),
                    'timestamp': datetime.now()
                }
                overall_healthy = False
                
        return {
            'overall_status': 'healthy' if overall_healthy else 'unhealthy',
            'checks': results
        }

# Usage
health_checker = HealthChecker()
health_checker.add_check('database', lambda: db.ping())
health_checker.add_check('cache', lambda: redis.ping())
health_checker.add_check('external_api', lambda: api_client.health_check())

Best Practices

Design Principles

  1. Design for Failure: Assume components will fail
  2. Loose Coupling: Minimize dependencies between services
  3. High Cohesion: Keep related functionality together
  4. Idempotency: Operations should be safely retryable
  5. Graceful Degradation: Maintain core functionality during failures

Performance Considerations

  1. Minimize Network Calls: Batch operations when possible
  2. Cache Frequently Used Data: Reduce database load
  3. Asynchronous Processing: Don’t block user operations
  4. Connection Pooling: Reuse expensive connections
  5. Data Locality: Keep related data close together

Security

  1. Defense in Depth: Multiple layers of security
  2. Principle of Least Privilege: Minimal necessary permissions
  3. Encrypt in Transit and at Rest: Protect sensitive data
  4. Input Validation: Sanitize all external inputs
  5. Audit Logging: Track all security-relevant events

Operational Excellence

  1. Infrastructure as Code: Version control your infrastructure
  2. Continuous Integration/Deployment: Automate deployments
  3. Chaos Engineering: Test failure scenarios regularly
  4. Capacity Planning: Plan for growth and traffic spikes
  5. Disaster Recovery: Have plans and test them regularly

Conclusion

Distributed systems are complex but essential for building scalable, resilient applications. Key takeaways:

  • Understand the trade-offs between consistency, availability, and partition tolerance
  • Design for failure from the beginning
  • Use proven patterns and algorithms
  • Monitor everything and plan for operational excellence
  • Start simple and add complexity only when needed

Success in distributed systems comes from understanding these fundamental concepts and applying them thoughtfully to solve real-world problems.