~/src/www.mokhan.ca/xlgmokha [main]
cat database-systems-guide.md
database-systems-guide.md 55434 bytes | 2020-08-01 12:00
symlink: /dev/eng/database-systems-guide.md

Database Systems Guide

This is a collection of notes covering database fundamentals, specific database systems, and design principles.

Database Fundamentals

Types of Databases

  1. Relational Databases (RDBMS)
    • Tables (relations) with rows (records)
    • Structured Query Language (SQL)
    • ACID properties
    • Examples: PostgreSQL, MySQL, Oracle, SQL Server
  2. NoSQL Databases
    • Document stores (MongoDB, CouchDB)
    • Key-value stores (Redis, DynamoDB)
    • Column family (Cassandra, HBase)
    • Graph databases (Neo4j, Amazon Neptune)
  3. NewSQL Databases
    • Combine SQL with NoSQL scalability
    • Examples: Google Spanner, CockroachDB

ACID Properties

Relational databases guarantee ACID properties:

  • Atomic: Transactions are all-or-nothing operations
  • Consistency: Database remains in a consistent state
  • Isolation: Multiple transactions can run concurrently but execute as if run sequentially
  • Durability: Committed changes are permanent on disk

Database Design Principles

Normalization

  • 1NF: Eliminate repeating groups
  • 2NF: Eliminate partial dependencies
  • 3NF: Eliminate transitive dependencies
  • BCNF: Boyce-Codd Normal Form

Indexing

Indexes are auxiliary data structures that act like a table of contents, ordered for efficient searching and sorting.

Types of Indexes:

  • Primary indexes (clustered)
  • Secondary indexes (non-clustered)
  • Composite indexes
  • Partial indexes
  • Expression indexes

Trade-offs:

  • Faster queries vs. slower writes
  • Storage overhead
  • Maintenance cost

PostgreSQL

PostgreSQL is an advanced open-source relational database with extensive features.

Core Concepts

  • Table: Named collection of rows
  • Row: Record with the same set of named columns
  • Column: Specific data type with fixed order
  • Database: Collection of tables
  • Cluster: Collection of databases managed by a single server instance

Data Types

Standard SQL Types

-- Numeric types
int, smallint, bigint
real, double precision
numeric(precision, scale)

-- Character types
char(N)
varchar(N)
text

-- Date/Time types
date
time
timestamp
timestamptz
interval

PostgreSQL-Specific Types

-- Arrays
integer[]
text[]

-- JSON
json
jsonb

-- Network types
inet
cidr
macaddr

-- UUID
uuid

-- Full-text search
tsvector
tsquery

Database Operations

Database Management

# Create database
createdb mydb

# Drop database
dropdb mydb

# Connect to database
psql -U username -d database_name

Table Operations

-- Create table
CREATE TABLE weather (
    city varchar(80),
    temp_lo int,
    temp_hi int,
    prcp real,
    date date
);

-- Drop table
DROP TABLE weather;

-- Alter table
ALTER TABLE weather ADD COLUMN humidity int;
ALTER TABLE weather DROP COLUMN humidity;

Data Manipulation

-- Insert data
INSERT INTO weather (city, temp_lo, temp_hi, prcp, date)
VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');

-- Bulk load from file
COPY weather FROM '/path/to/weather.txt';

-- Update data
UPDATE weather 
SET temp_hi = 55 
WHERE city = 'San Francisco' AND date = '1994-11-27';

-- Delete data
DELETE FROM weather WHERE city = 'San Francisco';

Querying Data

Basic Queries

-- Select all columns
SELECT * FROM weather;

-- Select specific columns
SELECT city, temp_lo, temp_hi, prcp, date FROM weather;

-- Calculated columns
SELECT city, (temp_hi + temp_lo) / 2 AS temp_avg, date FROM weather;

-- Filtering
SELECT * FROM weather 
WHERE city = 'San Francisco' AND prcp > 0.0;

-- Sorting
SELECT * FROM weather ORDER BY city;
SELECT * FROM weather ORDER BY city, temp_lo;

-- Distinct values
SELECT DISTINCT city FROM weather ORDER BY city;

Joins

-- Inner join
SELECT * 
FROM weather 
INNER JOIN cities ON (weather.city = cities.name);

-- Left join
SELECT w.city, w.temp_lo, c.location
FROM weather w
LEFT JOIN cities c ON (w.city = c.name);

-- Self join
SELECT w1.city, w1.date, w2.date
FROM weather w1, weather w2
WHERE w1.temp_lo < w2.temp_lo AND w1.city = w2.city;

Aggregate Functions

-- Basic aggregates
SELECT max(temp_lo) FROM weather;
SELECT min(temp_hi) FROM weather;
SELECT avg(temp_lo) FROM weather;
SELECT count(*) FROM weather;

-- Group by
SELECT city, max(temp_lo)
FROM weather
GROUP BY city;

-- Having clause
SELECT city, max(temp_lo)
FROM weather
GROUP BY city
HAVING max(temp_lo) < 40;

Advanced PostgreSQL Features

Window Functions

SELECT city, temp_lo,
       avg(temp_lo) OVER (PARTITION BY city) as city_avg,
       row_number() OVER (ORDER BY temp_lo DESC) as rank
FROM weather;

Common Table Expressions (CTEs)

WITH regional_sales AS (
    SELECT region, SUM(sales_amount) as total_sales
    FROM sales_data
    GROUP BY region
)
SELECT region, total_sales
FROM regional_sales
WHERE total_sales > 1000000;

JSON Operations

-- JSON column operations
SELECT data->>'name' as name, data->'age' as age
FROM users
WHERE data->>'city' = 'San Francisco';

-- JSON aggregation
SELECT json_agg(json_build_object('name', name, 'age', age))
FROM users;

PostgreSQL Administration

psql Commands

# Connect to database
psql -U postgres

# List databases
\l

# List roles/users
\du

# Connect to database
\c database_name

# List tables
\dt

# Describe table
\d table_name

# Show table data
\x  # Toggle expanded display

Performance Tuning

-- Analyze query performance
EXPLAIN ANALYZE SELECT * FROM large_table WHERE indexed_column = 'value';

-- Create indexes
CREATE INDEX idx_weather_city ON weather(city);
CREATE INDEX idx_weather_date_temp ON weather(date, temp_lo);

-- Vacuum and analyze
VACUUM ANALYZE weather;

MongoDB

MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents.

Core Concepts

  • Document: JSON-like record (BSON format)
  • Collection: Group of related documents (like a table)
  • Database: Container for collections
  • Field: Key-value pair within a document

Connection and Basic Operations

Connection String

mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]

# Examples
mongodb://localhost/people
mongodb://user:pass@cluster.mongodb.net/myapp

Database Operations

// Show current database
db

// Switch database
use mydb

// Show collections
show collections

// Show databases
show dbs

Document Operations (CRUD)

Create (Insert)

// Insert single document
mo = { name: 'mo', age: 30, city: 'Calgary' }
db.people.insert(mo)

// Insert multiple documents
db.people.insertMany([
  { name: 'alice', age: 25, city: 'Vancouver' },
  { name: 'bob', age: 35, city: 'Toronto' }
])

// Insert with specific _id
db.people.insert({ _id: 'user001', name: 'charlie', age: 28 })

Read (Query)

// Find all documents
db.people.find()

// Find with criteria
db.people.find({ name: 'mo' })
db.people.find({ age: { $gt: 25 } })

// Find one document
db.people.findOne({ name: 'alice' })

// Projection (select specific fields)
db.people.find({}, { name: 1, age: 1, _id: 0 })

// Limit results
db.people.find().limit(5)

// Sort results
db.people.find().sort({ age: 1 })  // 1 = ascending, -1 = descending

// Complex query
db.people.find({ 
  age: { $gte: 25, $lt: 40 },
  city: { $in: ['Calgary', 'Vancouver'] }
}).sort({ age: -1 })

Update

// Update single document
db.people.updateOne(
  { name: 'mo' },
  { $set: { age: 31, city: 'Edmonton' } }
)

// Update multiple documents
db.people.updateMany(
  { city: 'Calgary' },
  { $inc: { age: 1 } }
)

// Upsert (insert if not exists)
db.people.updateOne(
  { name: 'david' },
  { $set: { age: 29, city: 'Montreal' } },
  { upsert: true }
)

// Replace document
db.people.replaceOne(
  { name: 'mo' },
  { name: 'mo', age: 32, city: 'Calgary', occupation: 'developer' }
)

Delete

// Delete single document
db.people.deleteOne({ name: 'bob' })

// Delete multiple documents
db.people.deleteMany({ age: { $lt: 25 } })

// Delete all documents in collection
db.people.deleteMany({})

Advanced MongoDB Operations

Aggregation Pipeline

// Group and calculate
db.sales.aggregate([
  { $match: { status: 'completed' } },
  { $group: {
      _id: '$product',
      total: { $sum: '$amount' },
      count: { $sum: 1 }
  }},
  { $sort: { total: -1 } }
])

// Unwind arrays
db.orders.aggregate([
  { $unwind: '$items' },
  { $group: {
      _id: '$items.product',
      totalQuantity: { $sum: '$items.quantity' }
  }}
])

Indexing

// Create index
db.people.createIndex({ name: 1 })
db.people.createIndex({ age: 1, city: 1 })  // Compound index

// List indexes
db.people.getIndexes()

// Drop index
db.people.dropIndex({ name: 1 })

// Text index for search
db.articles.createIndex({ title: 'text', content: 'text' })
db.articles.find({ $text: { $search: 'mongodb tutorial' } })

MongoDB Configuration

.mongorc.js

Create a .mongorc.js file for custom functions:

// Custom helper functions
function findByAge(minAge, maxAge) {
  return db.people.find({ 
    age: { $gte: minAge, $lte: maxAge } 
  }).pretty()
}

function countByCity() {
  return db.people.aggregate([
    { $group: { _id: '$city', count: { $sum: 1 } } },
    { $sort: { count: -1 } }
  ])
}

Database Design Patterns

Relational Patterns

One-to-Many Relationship

-- Users table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(100)
);

-- Posts table
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    title VARCHAR(200),
    content TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Many-to-Many Relationship

-- Junction table for users and roles
CREATE TABLE user_roles (
    user_id INTEGER REFERENCES users(id),
    role_id INTEGER REFERENCES roles(id),
    PRIMARY KEY (user_id, role_id)
);

NoSQL Patterns

Embedded Documents

// User with embedded address
{
  _id: ObjectId("..."),
  name: "John Doe",
  email: "john@example.com",
  address: {
    street: "123 Main St",
    city: "Calgary",
    province: "AB",
    postal_code: "T2P 1A1"
  }
}

Referenced Documents

// Separate collections with references
// Users collection
{
  _id: ObjectId("user123"),
  name: "John Doe",
  email: "john@example.com"
}

// Posts collection
{
  _id: ObjectId("post456"),
  user_id: ObjectId("user123"),
  title: "My Blog Post",
  content: "...",
  created_at: ISODate("2023-01-01")
}

Database Performance Optimization

Query Optimization

PostgreSQL

-- Use EXPLAIN to analyze queries
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2023-01-01';

-- Create appropriate indexes
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_customer ON orders(customer_id);

MongoDB

// Use explain() for query analysis
db.orders.find({ order_date: { $gt: ISODate("2023-01-01") } })
         .explain("executionStats")

// Create supporting indexes
db.orders.createIndex({ order_date: 1 })
db.orders.createIndex({ customer_id: 1, order_date: 1 })

Database Monitoring

Key Metrics

  • Query execution time
  • Index usage
  • Connection pool status
  • Disk I/O
  • Memory usage
  • Lock contention

PostgreSQL Monitoring

-- Active queries
SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE state = 'active';

-- Index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Table statistics
SELECT schemaname, tablename, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables;

Best Practices

Security

  1. Authentication and Authorization
    • Use strong passwords
    • Implement role-based access control
    • Principle of least privilege
  2. Data Protection
    • Encrypt data at rest and in transit
    • Regular backups
    • Input validation and parameterized queries
  3. Network Security
    • Firewall configuration
    • VPN or private networks
    • SSL/TLS connections

Performance

  1. Query Optimization
    • Use appropriate indexes
    • Avoid N+1 queries
    • Optimize JOIN operations
    • Use query analysis tools
  2. Schema Design
    • Normalize appropriately
    • Consider denormalization for read-heavy workloads
    • Use appropriate data types
  3. Connection Management
    • Use connection pooling
    • Monitor connection limits
    • Implement proper timeout settings

Backup and Recovery

  1. Backup Strategies
    • Regular automated backups
    • Test backup restoration
    • Multiple backup locations
    • Point-in-time recovery capability
  2. High Availability
    • Master-slave replication
    • Cluster configuration
    • Failover procedures
    • Load balancing

This guide provides a comprehensive foundation for working with database systems, covering both relational and NoSQL approaches with practical examples and best practices.