This is a collection of notes covering distributed systems design, patterns, and architectural principles.
Fundamentals
What are Distributed Systems?
A distributed system is a collection of independent computers that appear to users as a single coherent system. The key characteristics include:
- Multiple autonomous nodes working together
- Communication over network (message passing)
- Shared resources and coordinated actions
- Fault tolerance and resilience
- Scalability to handle increased load
The 8 Fallacies of Distributed Computing
Originally proposed by Peter Deutsch, these are common false assumptions developers make:
- The network is reliable - Networks fail, packets get lost
- Latency isn’t a problem - Network calls are orders of magnitude slower than local calls
- Bandwidth isn’t a problem - Bandwidth is finite and expensive
- The network is secure - Networks are inherently insecure
- The topology won’t change - Network topology changes frequently
- There is one administrator - Multiple administrators with different priorities
- Transport cost isn’t a problem - Moving data costs time and money
- The network is homogeneous - Different hardware, protocols, and vendors
CAP Theorem
The CAP theorem states that in a distributed system, you can only guarantee two of the following three properties:
- Consistency: All nodes see the same data simultaneously
- Availability: System remains operational all the time
- Partition Tolerance: System continues despite network failures
Trade-offs:
- CP Systems: Consistent and Partition-tolerant (e.g., HBase, MongoDB)
- AP Systems: Available and Partition-tolerant (e.g., Cassandra, DynamoDB)
- CA Systems: Consistent and Available (traditional RDBMS, but impractical in distributed environments)
Distributed System Patterns
Leader Election
When you have a group of servers providing a service, they can elect a leader to coordinate activities.
Benefits:
- Centralized decision making
- Avoid split-brain scenarios
- Coordinate distributed operations
Common Algorithms:
- Bully Algorithm: Highest ID wins
- Ring Algorithm: Nodes arranged in logical ring
- Raft: Modern consensus algorithm
L: Leader
F: Follower
database
|
L ----------- F
| |
F ----------- F
Implementation Considerations:
- Leader failure detection
- Re-election process
- Preventing split-brain
- Health checking mechanisms
Map-Reduce
Map-Reduce is a programming model for processing large datasets in parallel across distributed clusters.
Two Main Steps:
- Map: Transform data into key/value pairs
- Reduce: Aggregate key/value pairs into final results
Input Data → Map Phase → Shuffle/Sort → Reduce Phase → Output
Map (Document A) → (word, count) pairs
Map (Document B) → (word, count) pairs
Map (Document C) → (word, count) pairs
↓
Shuffle/Group
↓
Reduce (word, [counts]) → (word, total_count)
Example: Word Count
# Map function
def map_function(document):
for word in document.split():
emit(word.lower(), 1)
# Reduce function
def reduce_function(word, counts):
emit(word, sum(counts))
Use Cases:
- Log analysis
- Data transformation
- Machine learning preprocessing
- Large-scale data processing
Pub/Sub (Publish-Subscribe)
Pub/Sub enables asynchronous communication between services through message brokers.
Publisher → Topic/Queue → Subscriber
↘ ↙
Message Broker
↗ ↘
Publisher → Topic/Queue → Subscriber
Benefits:
- Loose coupling between components
- Asynchronous processing
- Scalability through horizontal scaling
- Fault tolerance through message persistence
Implementation Patterns:
# Publisher
def publish_event(topic, message):
broker.publish(topic, {
'event_type': 'user_created',
'user_id': 12345,
'timestamp': datetime.now(),
'data': message
})
# Subscriber
def handle_user_created(message):
user_id = message['user_id']
send_welcome_email(user_id)
update_analytics(user_id)
broker.subscribe('user_events', handle_user_created)
Load Balancing
Distribute incoming requests across multiple servers to prevent overload.
Types of Load Balancing:
Round Robin
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)
Weighted Round Robin
Server A (weight: 3) gets 3 requests
Server B (weight: 2) gets 2 requests
Server C (weight: 1) gets 1 request
Least Connections
Route to server with fewest active connections
Health-based Routing
def route_request(request):
healthy_servers = [s for s in servers if s.is_healthy()]
if not healthy_servers:
return error_response()
return least_loaded_server(healthy_servers).handle(request)
Sharding
Distribute data across multiple databases to improve performance and scalability.
Sharding Strategies:
Hash-based Sharding
def get_shard(user_id, num_shards):
return hash(user_id) % num_shards
# User 12345 goes to shard 1
# User 67890 goes to shard 2
Range-based Sharding
Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
Directory-based Sharding
shard_directory = {
'users_1_1000': 'shard_1',
'users_1001_2000': 'shard_2',
'users_2001_3000': 'shard_3'
}
Challenges:
- Cross-shard queries
- Rebalancing data
- Hotspots
- Maintaining referential integrity
Consensus Algorithms
Raft Consensus
Raft is designed to be more understandable than Paxos while achieving the same goals.
Key Concepts:
- Leader: Handles all client requests
- Follower: Passive, responds to leader/candidate requests
- Candidate: Seeks votes to become leader
Raft Process:
- Leader Election: When no leader exists, followers become candidates
- Log Replication: Leader replicates log entries to followers
- Safety: Ensures consistency across the cluster
class RaftNode:
def __init__(self, node_id):
self.node_id = node_id
self.state = 'follower' # follower, candidate, leader
self.current_term = 0
self.voted_for = None
self.log = []
def request_vote(self, term, candidate_id):
if term > self.current_term:
self.current_term = term
self.voted_for = None
self.state = 'follower'
if (self.voted_for is None or
self.voted_for == candidate_id):
self.voted_for = candidate_id
return True
return False
Byzantine Fault Tolerance
Handle nodes that may behave maliciously or send conflicting information.
Byzantine Generals Problem:
- Multiple generals must coordinate attack
- Some generals may be traitors
- Need consensus despite malicious actors
Practical Byzantine Fault Tolerance (PBFT):
- Tolerates up to (n-1)/3 Byzantine failures
- Requires 3f+1 nodes to tolerate f failures
- Used in blockchain and cryptocurrency systems
Consistency Models
Strong Consistency
All nodes see the same data at the same time.
# Example: Banking system
def transfer_money(from_account, to_account, amount):
with transaction():
debit(from_account, amount)
credit(to_account, amount)
# All nodes see consistent state immediately
Eventual Consistency
System becomes consistent over time, but may be inconsistent temporarily.
# Example: Social media likes
def like_post(user_id, post_id):
# Update local datacenter immediately
local_db.increment_likes(post_id)
# Propagate to other datacenters asynchronously
async_replicate_to_all_datacenters(post_id, increment=1)
Causal Consistency
Maintains order of causally related operations.
User A posts message → User B replies to message
(This order must be preserved across all nodes)
Fault Tolerance Patterns
Circuit Breaker
Prevent cascading failures by detecting faults and encapsulating the logic for handling them.
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == 'OPEN':
if time.time() - self.last_failure_time > self.timeout:
self.state = 'HALF_OPEN'
else:
raise CircuitBreakerOpenException()
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = 'CLOSED'
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
Bulkhead
Isolate critical resources to prevent failure cascade.
# Separate connection pools for different services
class ServiceClients:
def __init__(self):
self.user_service_pool = ConnectionPool(
max_connections=10,
service='user-service'
)
self.order_service_pool = ConnectionPool(
max_connections=5,
service='order-service'
)
self.payment_service_pool = ConnectionPool(
max_connections=3,
service='payment-service'
)
Timeout and Retry
Handle temporary failures with appropriate timeouts and retry logic.
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TemporaryException as e:
if attempt == max_retries - 1:
raise e
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Monitoring and Observability
The Three Pillars
Metrics
Numerical measurements over time:
# Example metrics
request_count.increment()
response_time.record(duration)
error_rate.set(errors / total_requests)
active_connections.gauge(current_connections)
Logs
Discrete events with context:
logger.info("User login successful", {
'user_id': user.id,
'ip_address': request.ip,
'timestamp': datetime.now(),
'session_id': session.id
})
Traces
Request paths through distributed systems:
@trace_function
def process_order(order_id):
user = get_user(order.user_id) # Span 1
validate_payment(order.payment) # Span 2
update_inventory(order.items) # Span 3
send_confirmation(user.email) # Span 4
Health Checks
Implement comprehensive health checks for all services:
class HealthChecker:
def __init__(self):
self.checks = []
def add_check(self, name, check_func):
self.checks.append((name, check_func))
def get_health_status(self):
results = {}
overall_healthy = True
for name, check_func in self.checks:
try:
result = check_func()
results[name] = {
'status': 'healthy' if result else 'unhealthy',
'timestamp': datetime.now()
}
if not result:
overall_healthy = False
except Exception as e:
results[name] = {
'status': 'error',
'error': str(e),
'timestamp': datetime.now()
}
overall_healthy = False
return {
'overall_status': 'healthy' if overall_healthy else 'unhealthy',
'checks': results
}
# Usage
health_checker = HealthChecker()
health_checker.add_check('database', lambda: db.ping())
health_checker.add_check('cache', lambda: redis.ping())
health_checker.add_check('external_api', lambda: api_client.health_check())
Best Practices
Design Principles
- Design for Failure: Assume components will fail
- Loose Coupling: Minimize dependencies between services
- High Cohesion: Keep related functionality together
- Idempotency: Operations should be safely retryable
- Graceful Degradation: Maintain core functionality during failures
Performance Considerations
- Minimize Network Calls: Batch operations when possible
- Cache Frequently Used Data: Reduce database load
- Asynchronous Processing: Don’t block user operations
- Connection Pooling: Reuse expensive connections
- Data Locality: Keep related data close together
Security
- Defense in Depth: Multiple layers of security
- Principle of Least Privilege: Minimal necessary permissions
- Encrypt in Transit and at Rest: Protect sensitive data
- Input Validation: Sanitize all external inputs
- Audit Logging: Track all security-relevant events
Operational Excellence
- Infrastructure as Code: Version control your infrastructure
- Continuous Integration/Deployment: Automate deployments
- Chaos Engineering: Test failure scenarios regularly
- Capacity Planning: Plan for growth and traffic spikes
- Disaster Recovery: Have plans and test them regularly
Conclusion
Distributed systems are complex but essential for building scalable, resilient applications. Key takeaways:
- Understand the trade-offs between consistency, availability, and partition tolerance
- Design for failure from the beginning
- Use proven patterns and algorithms
- Monitor everything and plan for operational excellence
- Start simple and add complexity only when needed
Success in distributed systems comes from understanding these fundamental concepts and applying them thoughtfully to solve real-world problems.