Run this notebook: Open in Colab Open in Kaggle

SQL Query Optimization: From Slow to Fast in 10 Patterns¶

A query that returns the right answer in 30 seconds is wrong. This notebook covers EXPLAIN plans, indexing strategies, join optimization, and the most common anti-patterns that kill database performance.

1. Setup: Larger Synthetic Tables (10K Rows)¶

We generate realistic data using pandas and load it into SQLite. With 10K rows, the difference between indexed and unindexed queries becomes measurable.

import sqlite3
import pandas as pd
import numpy as np
import timeit
import time

np.random.seed(42)

conn = sqlite3.connect(':memory:')
cur = conn.cursor()

def sql(query):
    return pd.read_sql_query(query, conn)

def explain(query):
    """Run EXPLAIN QUERY PLAN and return as DataFrame."""
    return pd.read_sql_query(f'EXPLAIN QUERY PLAN {query}', conn)

print('SQLite version:', sqlite3.sqlite_version)

# Generate 10K orders, 1K customers, 500 products
N_CUSTOMERS = 1_000
N_PRODUCTS  = 500
N_ORDERS    = 10_000

regions   = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
statuses  = ['completed', 'completed', 'completed', 'returned', 'pending']

customers_df = pd.DataFrame({
    'customer_id': range(1, N_CUSTOMERS + 1),
    'name':        [f'Customer_{i}' for i in range(1, N_CUSTOMERS + 1)],
    'region':      np.random.choice(regions, N_CUSTOMERS),
    'email':       [f'user{i}@example.com' for i in range(1, N_CUSTOMERS + 1)],
    'signup_date': pd.date_range('2020-01-01', periods=N_CUSTOMERS, freq='8h').strftime('%Y-%m-%d')
})

products_df = pd.DataFrame({
    'product_id': range(1, N_PRODUCTS + 1),
    'name':       [f'Product_{i}' for i in range(1, N_PRODUCTS + 1)],
    'category':   np.random.choice(categories, N_PRODUCTS),
    'price':      np.round(np.random.uniform(5, 500, N_PRODUCTS), 2),
    'sku':        [f'SKU-{i:05d}' for i in range(1, N_PRODUCTS + 1)]
})

order_dates = pd.date_range('2023-01-01', '2023-12-31', periods=N_ORDERS)
orders_df = pd.DataFrame({
    'order_id':    range(1, N_ORDERS + 1),
    'customer_id': np.random.randint(1, N_CUSTOMERS + 1, N_ORDERS),
    'product_id':  np.random.randint(1, N_PRODUCTS + 1, N_ORDERS),
    'quantity':    np.random.randint(1, 6, N_ORDERS),
    'order_date':  order_dates.strftime('%Y-%m-%d'),
    'status':      np.random.choice(statuses, N_ORDERS)
})

# Load into SQLite
customers_df.to_sql('customers', conn, if_exists='replace', index=False)
products_df.to_sql('products',   conn, if_exists='replace', index=False)
orders_df.to_sql('orders',       conn, if_exists='replace', index=False)

print(f'Loaded: {len(customers_df)} customers, {len(products_df)} products, {len(orders_df)} orders')

2. EXPLAIN QUERY PLAN: How to Read It¶

Before optimizing, you need to understand what the database is doing. EXPLAIN QUERY PLAN shows how SQLite plans to execute a query.

Key terms:

Term	Meaning
`SCAN table`	Full table scan — reads every row. O(n). Slow on large tables.
`SEARCH table USING INDEX`	Index lookup. O(log n). Fast.
`SEARCH table USING COVERING INDEX`	All needed columns are in the index. Zero table access.
`USE TEMP B-TREE FOR ORDER BY`	Sorting in memory — may indicate missing index
`AUTOMATIC PARTIAL COVERING INDEX`	SQLite created a temporary index at query time

# Full table scan — no index
query_scan = "SELECT * FROM orders WHERE customer_id = 42"

print('EXPLAIN QUERY PLAN — no index on customer_id:')
print('Look for: SCAN orders')
explain(query_scan)

# Create index and compare
cur.execute('CREATE INDEX idx_orders_customer_id ON orders (customer_id)')
conn.commit()

print('EXPLAIN QUERY PLAN — after creating index on customer_id:')
print('Look for: SEARCH orders USING INDEX')
explain(query_scan)

3. Index Types: Single, Composite, Covering¶

Single column index: index on one column. Good for equality and range filters.

Composite index: index on multiple columns. Column order matters — the index is useful for queries that filter on the leftmost prefix.

Covering index: all columns needed by the query are in the index. The database never touches the table — reads only from the index.

Composite index (a, b, c):
  WHERE a = 1               ✅ uses index
  WHERE a = 1 AND b = 2     ✅ uses index
  WHERE b = 2               ❌ skips first column — no index benefit
  WHERE a = 1 AND c = 3     ⚠️  uses index for a, scans for c

# Before composite index
query_composite = """
SELECT order_id, quantity, order_date
FROM orders
WHERE customer_id = 42 AND status = 'completed'
"""

print('Plan BEFORE composite index:')
display(explain(query_composite))

# Create composite index
cur.execute('CREATE INDEX idx_orders_customer_status ON orders (customer_id, status)')
conn.commit()

print('Plan AFTER composite index (customer_id, status):')
display(explain(query_composite))

print('\nNote: column ORDER in composite index matches WHERE clause order')

# Covering index — includes all SELECT columns
query_covering = """
SELECT customer_id, status, order_date
FROM orders
WHERE customer_id = 42
"""

print('Plan with single-column index (partial cover):')
display(explain(query_covering))

# Create covering index
cur.execute('CREATE INDEX idx_orders_cover ON orders (customer_id, status, order_date)')
conn.commit()

print('Plan with COVERING index (customer_id, status, order_date):')
print('Look for: USING COVERING INDEX')
display(explain(query_covering))

4. Join Optimization¶

Join performance depends on:

Filter early — WHERE before joining reduces rows processed
Join order — join the most-filtered (smallest result) table first
Index the join key — foreign keys should always be indexed
**Avoid SELECT *** — only pull columns you need

# Bad: filter AFTER join
query_bad_join = """
SELECT c.name, o.order_date, p.name AS product
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.status = 'completed'
  AND c.region = 'North'
"""

# Better: pre-filter with subquery/CTE before joining
query_good_join = """
WITH north_customers AS (
    SELECT customer_id, name FROM customers WHERE region = 'North'
),
completed_orders AS (
    SELECT order_id, customer_id, product_id, order_date
    FROM orders
    WHERE status = 'completed'
)
SELECT nc.name, co.order_date, p.name AS product
FROM completed_orders co
JOIN north_customers nc USING (customer_id)
JOIN products p ON co.product_id = p.product_id
"""

print('Join plan — filter after join (less optimal):')
display(explain(query_bad_join))

print('Join plan — pre-filter in CTEs (more explicit):')
display(explain(query_good_join))

5. Common Anti-Patterns and Fixes¶

These five patterns collectively account for the majority of slow queries in production.

# --- Anti-pattern 1: Function in WHERE clause defeats index ---

# BAD: wrapping the column in a function prevents index use
query_func_bad = """
SELECT * FROM orders
WHERE strftime('%Y', order_date) = '2023'
"""

# GOOD: use range comparison on the raw column
query_func_good = """
SELECT * FROM orders
WHERE order_date >= '2023-01-01'
  AND order_date <  '2024-01-01'
"""

cur.execute('CREATE INDEX idx_orders_date ON orders (order_date)')
conn.commit()

print('BAD  — function wraps column (cannot use index):')
display(explain(query_func_bad))

print('GOOD — range comparison on raw column (uses index):')
display(explain(query_func_good))

# --- Anti-pattern 2: SELECT * vs SELECT specific columns ---

# Measurement: time SELECT * vs targeted columns
def time_query(query, iterations=100):
    t = timeit.timeit(
        lambda: conn.execute(query).fetchall(),
        number=iterations
    )
    return round(t / iterations * 1000, 3)  # ms per query

q_star  = 'SELECT * FROM orders WHERE customer_id = 42'
q_cols  = 'SELECT order_id, order_date, status FROM orders WHERE customer_id = 42'

t_star = time_query(q_star)
t_cols = time_query(q_cols)

print(f'SELECT *          : {t_star} ms/query')
print(f'SELECT 3 columns  : {t_cols} ms/query')
print(f'Speedup           : {round(t_star/t_cols, 2)}x')
print()
print('SELECT * issues:')
print('  1. Transfers unused columns (network overhead in real databases)')
print('  2. Breaks covering indexes — forces table access')
print('  3. Schema changes break downstream code silently')

# --- Anti-pattern 3: N+1 Query Problem ---

# BAD: loop in Python calling SQL per customer
def n_plus_1_bad(customer_ids):
    results = []
    for cid in customer_ids:
        row = conn.execute(
            'SELECT COUNT(*) FROM orders WHERE customer_id = ?', (cid,)
        ).fetchone()[0]
        results.append((cid, row))
    return results

# GOOD: single JOIN query
def n_plus_1_good(customer_ids):
    placeholders = ','.join('?' * len(customer_ids))
    return conn.execute(
        f'SELECT customer_id, COUNT(*) AS order_count FROM orders '
        f'WHERE customer_id IN ({placeholders}) GROUP BY customer_id',
        customer_ids
    ).fetchall()

sample_ids = list(range(1, 51))  # 50 customers

t_bad  = timeit.timeit(lambda: n_plus_1_bad(sample_ids),  number=20)
t_good = timeit.timeit(lambda: n_plus_1_good(sample_ids), number=20)

print(f'N+1 (50 queries) : {round(t_bad/20*1000, 2)} ms')
print(f'Single JOIN query : {round(t_good/20*1000, 2)} ms')
print(f'Speedup           : {round(t_bad/t_good, 1)}x')
print()
print('N+1 is catastrophic at scale: 10K customers = 10K+1 round trips')

# --- Anti-pattern 4: COUNT(*) vs COUNT(column) ---

# COUNT(*) counts all rows including NULLs
# COUNT(column) counts only non-NULL values — different semantics

# Add a nullable column to demonstrate
cur.execute('ALTER TABLE orders ADD COLUMN discount REAL')
cur.execute('UPDATE orders SET discount = 0.1 WHERE order_id % 3 = 0')
conn.commit()

result = sql("""
SELECT
    COUNT(*)        AS count_all_rows,
    COUNT(discount) AS count_non_null_discount,
    COUNT(1)        AS count_one  -- same as COUNT(*), slightly explicit
FROM orders
""")

print('COUNT(*) vs COUNT(column):')
print('  COUNT(*):          counts ALL rows (including NULLs in any column)')
print('  COUNT(col):        counts rows where col IS NOT NULL')
print('  COUNT(DISTINCT c): counts unique non-NULL values')
print()
display(result)

# --- Anti-pattern 5: LIKE with leading wildcard ---

cur.execute('CREATE INDEX idx_customers_name ON customers (name)')
conn.commit()

# BAD: leading wildcard — full scan regardless of index
q_like_bad  = "SELECT * FROM customers WHERE name LIKE '%_42'"
# GOOD: suffix wildcard — can use index for prefix scan
q_like_good = "SELECT * FROM customers WHERE name LIKE 'Customer_4%'"

print('LIKE with leading wildcard (%) — defeats index:')
display(explain(q_like_bad))

print('LIKE with trailing wildcard — uses index:')
display(explain(q_like_good))

print('Fix: use full-text search (FTS5 in SQLite) for contains searches')
print('Or: store reversed strings and use suffix LIKE as prefix LIKE on reversed')

6. Timing Comparisons: With vs Without Indexes¶

# Create a second connection WITHOUT indexes for fair comparison
conn_no_idx = sqlite3.connect(':memory:')
customers_df.to_sql('customers', conn_no_idx, if_exists='replace', index=False)
products_df.to_sql('products',   conn_no_idx, if_exists='replace', index=False)
orders_df.to_sql('orders',       conn_no_idx, if_exists='replace', index=False)

# Benchmark queries
benchmarks = [
    (
        'Lookup by customer_id',
        'SELECT order_id, status FROM orders WHERE customer_id = 500'
    ),
    (
        'Filter by status + date range',
        "SELECT COUNT(*) FROM orders WHERE status='completed' AND order_date >= '2023-06-01'"
    ),
    (
        'JOIN orders + customers',
        "SELECT c.name, COUNT(*) FROM orders o JOIN customers c USING(customer_id) GROUP BY c.customer_id LIMIT 100"
    ),
]

N = 50
print(f'Timing {N} executions each. Units: milliseconds per query.\n')
print(f'{"Query":<35} {"No Index":>12} {"Indexed":>12} {"Speedup":>10}')
print('-' * 72)

for name, q in benchmarks:
    t_no_idx = timeit.timeit(lambda: conn_no_idx.execute(q).fetchall(), number=N) / N * 1000
    t_idx    = timeit.timeit(lambda: conn.execute(q).fetchall(),        number=N) / N * 1000
    speedup  = t_no_idx / t_idx if t_idx > 0 else float('inf')
    print(f'{name:<35} {t_no_idx:>11.3f}ms {t_idx:>11.3f}ms {speedup:>9.1f}x')

conn_no_idx.close()

# Summary: all current indexes in our database
print('All indexes on our database:')
sql("SELECT name, tbl_name FROM sqlite_master WHERE type='index' ORDER BY tbl_name, name")

Cheat Sheet¶

-- Check query plan
EXPLAIN QUERY PLAN SELECT ...;
-- SCAN = full table scan (slow) | SEARCH USING INDEX = fast

-- Create indexes
CREATE INDEX idx_name     ON table (col);               -- single
CREATE INDEX idx_name     ON table (col1, col2);        -- composite
CREATE INDEX idx_name     ON table (col1, col2, col3);  -- covering
CREATE UNIQUE INDEX ...;                                 -- enforce uniqueness
DROP INDEX idx_name;

-- Composite index left-prefix rule:
-- Index (a, b, c) helps: WHERE a=1 | WHERE a=1 AND b=2 | ORDER BY a, b
-- Index (a, b, c) doesn't help: WHERE b=1 | WHERE c=1

-- Anti-patterns to avoid:
-- 1. WHERE YEAR(col) = 2023       →  WHERE col >= '2023-01-01' AND col < '2024-01-01'
-- 2. SELECT *                     →  SELECT only needed columns
-- 3. Loop with one query per row  →  Single JOIN/GROUP BY
-- 4. LIKE '%keyword'              →  LIKE 'keyword%' or use FTS
-- 5. COUNT(nullable_col) != COUNT(*)  — know the difference

-- Index checklist:
-- ✅ Index every foreign key
-- ✅ Index high-cardinality columns used in WHERE
-- ✅ Composite index column order = most selective first
-- ⚠️  Too many indexes slow down INSERT/UPDATE/DELETE

7. Exercises¶

EXPLAIN analysis: Write a query that joins all three tables and filters by status and region. Run EXPLAIN QUERY PLAN on it, identify all SCAN operations, then add appropriate indexes to eliminate them. Show the before/after plans.
Composite index design: You have a query: WHERE region = 'North' AND status = 'completed' ORDER BY signup_date DESC. Design the optimal composite index. Explain why column order matters for this specific query.
Benchmark: Create a query that aggregates total revenue by category and month. Benchmark it with timeit against 1K, 5K, and 10K rows. Plot the results. Does adding an index on (product_id, order_date, status) help? Why or why not?

Fix the anti-patterns: The following query has 3 anti-patterns. Identify and fix each one:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE strftime('%m', o.order_date) = '06'
AND c.email LIKE '%@example.com'

Partial index: SQLite supports partial indexes (CREATE INDEX ... WHERE status = 'completed'). Create one for completed orders and show with EXPLAIN QUERY PLAN that it’s used. Compare its size (via sqlite_master) to a full index on the same column.