Run this notebook: Open in Colab Open in Kaggle

Advanced SQL: Window Functions, CTEs, and Query Patterns That Actually Matter¶

Beyond SELECT * FROM table — window functions, CTEs, recursive queries, and lateral joins are what separate junior from senior SQL. This notebook covers the patterns that appear in every data engineering interview and production pipeline.

1. Setup: Synthetic E-Commerce Database¶

We use Python’s built-in sqlite3 module — no installation needed. We’ll create three tables that mirror a real e-commerce schema:

customers — who bought
products — what was bought
orders — the transaction record

All SQL in this notebook runs against a real SQLite database in memory.

import sqlite3
import pandas as pd
from datetime import date, timedelta
import random
random.seed(42)

# In-memory SQLite database
conn = sqlite3.connect(':memory:')
conn.row_factory = sqlite3.Row  # named columns
cur = conn.cursor()

# Helper: run SQL and return a DataFrame
def sql(query, params=None):
    if params:
        return pd.read_sql_query(query, conn, params=params)
    return pd.read_sql_query(query, conn)

print('SQLite version:', sqlite3.sqlite_version)

# ---------- DDL ----------
cur.executescript("""
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS employees;

CREATE TABLE customers (
    customer_id   INTEGER PRIMARY KEY,
    name          TEXT NOT NULL,
    region        TEXT NOT NULL,
    signup_date   TEXT NOT NULL
);

CREATE TABLE products (
    product_id    INTEGER PRIMARY KEY,
    name          TEXT NOT NULL,
    category      TEXT NOT NULL,
    price         REAL NOT NULL
);

CREATE TABLE orders (
    order_id      INTEGER PRIMARY KEY,
    customer_id   INTEGER REFERENCES customers(customer_id),
    product_id    INTEGER REFERENCES products(product_id),
    quantity      INTEGER NOT NULL,
    order_date    TEXT NOT NULL,
    status        TEXT NOT NULL
);

CREATE TABLE employees (
    emp_id        INTEGER PRIMARY KEY,
    name          TEXT NOT NULL,
    manager_id    INTEGER,
    department    TEXT,
    salary        REAL
);
""")

# ---------- Seed Data ----------
regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']

customers_data = [
    (i, f'Customer_{i}', random.choice(regions),
     str(date(2022, 1, 1) + timedelta(days=random.randint(0, 365))))
    for i in range(1, 51)
]

products_data = [
    (i, f'Product_{i}', random.choice(categories), round(random.uniform(5, 500), 2))
    for i in range(1, 21)
]

start = date(2023, 1, 1)
statuses = ['completed', 'completed', 'completed', 'returned', 'pending']
orders_data = [
    (i, random.randint(1, 50), random.randint(1, 20),
     random.randint(1, 5),
     str(start + timedelta(days=random.randint(0, 364))),
     random.choice(statuses))
    for i in range(1, 301)
]

# Employee org-chart (for recursive CTE)
employees_data = [
    (1, 'Alice (CEO)',    None, 'Executive', 250000),
    (2, 'Bob (VP Eng)',   1,    'Engineering', 180000),
    (3, 'Carol (VP Mkt)', 1,    'Marketing',   170000),
    (4, 'Dave (EM)',      2,    'Engineering', 140000),
    (5, 'Eve (EM)',       2,    'Engineering', 138000),
    (6, 'Frank (IC)',     4,    'Engineering', 120000),
    (7, 'Grace (IC)',     4,    'Engineering', 118000),
    (8, 'Hank (IC)',      5,    'Engineering', 115000),
    (9, 'Ivy (Mkt Mgr)',  3,    'Marketing',   130000),
    (10,'Jack (Mkt IC)',  9,    'Marketing',   105000),
]

cur.executemany('INSERT INTO customers VALUES (?,?,?,?)', customers_data)
cur.executemany('INSERT INTO products VALUES (?,?,?,?)', products_data)
cur.executemany('INSERT INTO orders VALUES (?,?,?,?,?,?)', orders_data)
cur.executemany('INSERT INTO employees VALUES (?,?,?,?,?)', employees_data)
conn.commit()

print(f'customers: {cur.execute("SELECT COUNT(*) FROM customers").fetchone()[0]}')
print(f'products:  {cur.execute("SELECT COUNT(*) FROM products").fetchone()[0]}')
print(f'orders:    {cur.execute("SELECT COUNT(*) FROM orders").fetchone()[0]}')
print(f'employees: {cur.execute("SELECT COUNT(*) FROM employees").fetchone()[0]}')

# Preview the tables
print('--- orders (first 5) ---')
display(sql('SELECT * FROM orders LIMIT 5'))
print('--- customers (first 5) ---')
display(sql('SELECT * FROM customers LIMIT 5'))
print('--- products (first 5) ---')
display(sql('SELECT * FROM products LIMIT 5'))

2. Window Functions¶

Window functions compute a value across a window (set of rows related to the current row) without collapsing rows the way GROUP BY does.

FUNCTION() OVER (
    PARTITION BY <column>   -- optional: reset window per group
    ORDER BY <column>       -- optional: order within window
    ROWS/RANGE BETWEEN ...  -- optional: frame definition
)

Category of window functions:

Category	Functions
Ranking	ROW_NUMBER, RANK, DENSE_RANK, NTILE
Offset	LAG, LEAD, FIRST_VALUE, LAST_VALUE
Aggregate	SUM, AVG, MIN, MAX, COUNT (with OVER)

# --- ROW_NUMBER, RANK, DENSE_RANK ---
# Business question: Rank customers by total spend. Handle ties correctly.

query_ranking = """
WITH customer_spend AS (
    SELECT
        c.customer_id,
        c.name,
        c.region,
        ROUND(SUM(o.quantity * p.price), 2) AS total_spend
    FROM customers c
    JOIN orders o USING (customer_id)
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY c.customer_id, c.name, c.region
)
SELECT
    name,
    region,
    total_spend,
    ROW_NUMBER() OVER (ORDER BY total_spend DESC) AS row_num,
    RANK()       OVER (ORDER BY total_spend DESC) AS rank_val,
    DENSE_RANK() OVER (ORDER BY total_spend DESC) AS dense_rank_val
FROM customer_spend
ORDER BY total_spend DESC
LIMIT 10
"""

print('ROW_NUMBER vs RANK vs DENSE_RANK:')
print('  ROW_NUMBER: always unique (1,2,3,4...)')
print('  RANK:       skips after tie (1,2,2,4...)')
print('  DENSE_RANK: no gaps after tie (1,2,2,3...)')
sql(query_ranking)

# --- ROW_NUMBER with PARTITION BY ---
# Business question: Top-spending customer per region

query_top_per_region = """
WITH customer_spend AS (
    SELECT
        c.customer_id,
        c.name,
        c.region,
        ROUND(SUM(o.quantity * p.price), 2) AS total_spend
    FROM customers c
    JOIN orders o USING (customer_id)
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY c.customer_id, c.name, c.region
),
ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_spend DESC) AS rn
    FROM customer_spend
)
SELECT region, name, total_spend
FROM ranked
WHERE rn = 1
ORDER BY region
"""

print('Top-spending customer per region (PARTITION BY trick):')
sql(query_top_per_region)

# --- NTILE ---
# Business question: Bucket customers into spending quartiles

query_ntile = """
WITH customer_spend AS (
    SELECT
        c.customer_id,
        c.name,
        ROUND(SUM(o.quantity * p.price), 2) AS total_spend
    FROM customers c
    JOIN orders o USING (customer_id)
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY c.customer_id, c.name
)
SELECT
    name,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_spend
ORDER BY total_spend DESC
LIMIT 12
"""

print('NTILE(4): splits rows into 4 equal buckets (Q1=highest spenders)')
sql(query_ntile)

# --- LAG / LEAD ---
# Business question: Month-over-month revenue change

query_lag_lead = """
WITH monthly_revenue AS (
    SELECT
        SUBSTR(order_date, 1, 7) AS month,
        ROUND(SUM(o.quantity * p.price), 2) AS revenue
    FROM orders o
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY month
    ORDER BY month
)
SELECT
    month,
    revenue,
    LAG(revenue, 1)  OVER (ORDER BY month) AS prev_month_rev,
    LEAD(revenue, 1) OVER (ORDER BY month) AS next_month_rev,
    ROUND(
        (revenue - LAG(revenue,1) OVER (ORDER BY month))
        / LAG(revenue,1) OVER (ORDER BY month) * 100, 1
    ) AS mom_pct_change
FROM monthly_revenue
"""

print('LAG(col, n): looks back n rows | LEAD(col, n): looks forward n rows')
sql(query_lag_lead)

# --- Running Total and Moving Average (ROWS BETWEEN frame) ---
# Business question: Cumulative revenue and 3-month moving average

query_running = """
WITH monthly_revenue AS (
    SELECT
        SUBSTR(order_date, 1, 7) AS month,
        ROUND(SUM(o.quantity * p.price), 2) AS revenue
    FROM orders o
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY month
    ORDER BY month
)
SELECT
    month,
    revenue,
    ROUND(SUM(revenue)  OVER (ORDER BY month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 2) AS cumulative_rev,
    ROUND(AVG(revenue)  OVER (ORDER BY month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2)         AS moving_avg_3m
FROM monthly_revenue
"""

print('Frame options:')
print('  UNBOUNDED PRECEDING → from first row')
print('  N PRECEDING         → N rows back')
print('  CURRENT ROW         → the current row')
print('  N FOLLOWING         → N rows ahead')
print('  UNBOUNDED FOLLOWING → to last row')
print()
sql(query_running)

3. Common Table Expressions (CTEs)¶

CTEs (introduced with WITH) give intermediate result sets a name. They improve readability and enable recursive queries.

Syntax:

WITH cte_name AS (
    SELECT ...
),
another_cte AS (
    SELECT ... FROM cte_name ...
)
SELECT * FROM another_cte;

CTEs vs subqueries: CTEs are named (reusable), readable (top-to-bottom flow), and required for recursive queries.

# --- Basic CTE ---
# Find products with above-average revenue in their category

query_basic_cte = """
WITH product_revenue AS (
    SELECT
        p.product_id,
        p.name        AS product_name,
        p.category,
        p.price,
        SUM(o.quantity) AS units_sold,
        ROUND(SUM(o.quantity * p.price), 2) AS revenue
    FROM products p
    JOIN orders o USING (product_id)
    WHERE o.status = 'completed'
    GROUP BY p.product_id, p.name, p.category, p.price
),
category_avg AS (
    SELECT category, ROUND(AVG(revenue), 2) AS avg_category_revenue
    FROM product_revenue
    GROUP BY category
)
SELECT
    pr.product_name,
    pr.category,
    pr.revenue,
    ca.avg_category_revenue,
    ROUND(pr.revenue - ca.avg_category_revenue, 2) AS above_avg_by
FROM product_revenue pr
JOIN category_avg ca USING (category)
WHERE pr.revenue > ca.avg_category_revenue
ORDER BY above_avg_by DESC
"""

print('Chained CTEs: product_revenue → category_avg → final join')
sql(query_basic_cte)

# --- Recursive CTE: Employee Org-Chart Hierarchy ---
# Find all reports under each manager at every level

query_recursive = """
WITH RECURSIVE org_chart AS (
    -- Anchor: start with the CEO (no manager)
    SELECT
        emp_id,
        name,
        manager_id,
        0              AS level,
        name           AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive step: each employee's direct reports
    SELECT
        e.emp_id,
        e.name,
        e.manager_id,
        oc.level + 1,
        oc.path || ' → ' || e.name
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.emp_id
)
SELECT
    REPEAT('  ', level) || name AS indented_name,
    level,
    path
FROM org_chart
ORDER BY path
"""

print('Recursive CTE structure:')
print('  1. Anchor member  — base case (WHERE manager_id IS NULL)')
print('  2. UNION ALL')
print('  3. Recursive member — joins back to the CTE itself')
print()
result = sql(query_recursive)
for _, row in result.iterrows():
    print(row['indented_name'])

4. Advanced Aggregations: ROLLUP, CUBE, GROUPING SETS, FILTER¶

Standard GROUP BY produces one row per unique combination of grouped columns. These extensions produce multiple levels of aggregation in one query.

Clause	What it produces
`ROLLUP(a,b)`	(a,b), (a), () — hierarchical subtotals
`CUBE(a,b)`	all combinations: (a,b), (a), (b), ()
`GROUPING SETS`	explicit list of groupings
`FILTER (WHERE ...)`	conditional aggregation in one pass

SQLite supports GROUPING SETS via workaround and FILTER natively.

# --- FILTER clause (conditional aggregation) ---
# Business question: pivot order counts by status in one query

query_filter = """
SELECT
    p.category,
    COUNT(*)                                    AS total_orders,
    COUNT(*) FILTER (WHERE o.status='completed') AS completed,
    COUNT(*) FILTER (WHERE o.status='returned')  AS returned,
    COUNT(*) FILTER (WHERE o.status='pending')   AS pending,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE o.status='returned')
        / COUNT(*), 1
    ) AS return_rate_pct
FROM orders o
JOIN products p USING (product_id)
GROUP BY p.category
ORDER BY return_rate_pct DESC
"""

print('FILTER clause: conditional aggregate without CASE WHEN clutter')
sql(query_filter)

# --- GROUPING SETS emulation in SQLite via UNION ALL ---
# (SQLite doesn't have native ROLLUP/CUBE but the concept is identical)
# Business question: revenue by (region, category), (region), (category), (grand total)

query_grouping_sets = """
SELECT region, category, ROUND(SUM(o.quantity * p.price),2) AS revenue, 'region+category' AS grouping_level
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY region, category

UNION ALL

SELECT region, 'ALL', ROUND(SUM(o.quantity * p.price),2), 'region_subtotal'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY region

UNION ALL

SELECT 'ALL', category, ROUND(SUM(o.quantity * p.price),2), 'category_subtotal'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY category

UNION ALL

SELECT 'ALL', 'ALL', ROUND(SUM(o.quantity * p.price),2), 'grand_total'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'

ORDER BY grouping_level, region, category
"""

print('GROUPING SETS equivalent (ROLLUP simulation via UNION ALL):')
print('In PostgreSQL/MySQL you would write: GROUP BY ROLLUP(region, category)')
sql(query_grouping_sets)

5. Lateral Joins (Correlated Subqueries)¶

A lateral join lets each row of the outer query be passed into a subquery, so the subquery can reference outer columns. In SQLite, this is done with correlated subqueries in the SELECT clause or CROSS JOIN with a subquery that references the outer table.

PostgreSQL syntax: JOIN LATERAL (...) AS sub ON TRUE

Use case: Get the N most recent orders per customer without a CTE + ROW_NUMBER.

# --- Lateral join pattern (correlated subquery) ---
# Business question: For each customer, show their most recent order date and amount

query_lateral = """
SELECT
    c.customer_id,
    c.name,
    c.region,
    -- Correlated subquery: runs once per customer row (lateral equivalent)
    (
        SELECT MAX(o.order_date)
        FROM orders o
        WHERE o.customer_id = c.customer_id
          AND o.status = 'completed'
    ) AS last_order_date,
    (
        SELECT ROUND(SUM(o.quantity * p.price), 2)
        FROM orders o
        JOIN products p USING (product_id)
        WHERE o.customer_id = c.customer_id
          AND o.status = 'completed'
          AND o.order_date = (
              SELECT MAX(o2.order_date)
              FROM orders o2
              WHERE o2.customer_id = c.customer_id
                AND o2.status = 'completed'
          )
    ) AS last_order_value
FROM customers c
ORDER BY last_order_date DESC NULLS LAST
LIMIT 10
"""

print('Correlated subquery (lateral equivalent in SQLite):')
print('Each subquery runs once per outer row, referencing c.customer_id')
sql(query_lateral)

6. Practical Patterns¶

6.1 Deduplication with ROW_NUMBER¶

6.2 Gaps and Islands¶

6.3 Running Percentiles¶

# --- Pattern 1: Deduplication with ROW_NUMBER ---
# Keep only the most recent order per customer

query_dedup = """
WITH ranked_orders AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC, order_id DESC
        ) AS rn
    FROM orders
    WHERE status = 'completed'
)
SELECT order_id, customer_id, product_id, quantity, order_date, status
FROM ranked_orders
WHERE rn = 1
ORDER BY customer_id
LIMIT 10
"""

print('Deduplication: keep latest completed order per customer')
print('Key: PARTITION BY the group, ORDER BY the tiebreaker, filter WHERE rn = 1')
sql(query_dedup)

# --- Pattern 2: Gaps and Islands ---
# Find consecutive months with orders (islands) and gaps between them

# First, create a simple monthly sequence to detect gaps
query_gaps_islands = """
WITH monthly_orders AS (
    SELECT DISTINCT SUBSTR(order_date, 1, 7) AS month
    FROM orders
    ORDER BY month
),
with_lag AS (
    SELECT
        month,
        LAG(month) OVER (ORDER BY month) AS prev_month
    FROM monthly_orders
),
-- Island detection: assign a group number that increments on each gap
islands AS (
    SELECT
        month,
        prev_month,
        CASE
            WHEN prev_month IS NULL THEN 'start'
            WHEN CAST(SUBSTR(month,6,2) AS INT) -
                 CAST(SUBSTR(prev_month,6,2) AS INT) > 1
              OR CAST(SUBSTR(month,1,4) AS INT) >
                 CAST(SUBSTR(prev_month,1,4) AS INT)
            THEN 'gap'
            ELSE 'continuous'
        END AS sequence_type
    FROM with_lag
)
SELECT month, prev_month, sequence_type
FROM islands
ORDER BY month
"""

print('Gaps and Islands: detect breaks in a sequence')
sql(query_gaps_islands)

# --- Pattern 3: Running Percentile (approximate) ---
# Assign each order to a percentile bucket based on order value

query_percentile = """
WITH order_values AS (
    SELECT
        o.order_id,
        o.customer_id,
        ROUND(o.quantity * p.price, 2) AS order_value
    FROM orders o
    JOIN products p USING (product_id)
    WHERE o.status = 'completed'
),
percentiled AS (
    SELECT *,
        NTILE(100) OVER (ORDER BY order_value) AS percentile,
        NTILE(10)  OVER (ORDER BY order_value) AS decile
    FROM order_values
)
SELECT
    decile,
    COUNT(*)                      AS orders_in_decile,
    ROUND(MIN(order_value), 2)    AS min_value,
    ROUND(AVG(order_value), 2)    AS avg_value,
    ROUND(MAX(order_value), 2)    AS max_value
FROM percentiled
GROUP BY decile
ORDER BY decile
"""

print('Percentile distribution of order values by decile:')
sql(query_percentile)

Cheat Sheet¶

-- Window function template
FUNC() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN N PRECEDING AND CURRENT ROW)

-- Ranking: ROW_NUMBER (unique) | RANK (gaps) | DENSE_RANK (no gaps) | NTILE(n)
-- Offset:  LAG(col, n, default) | LEAD(col, n, default)
-- Frames:  UNBOUNDED PRECEDING | N PRECEDING | CURRENT ROW | N FOLLOWING | UNBOUNDED FOLLOWING

-- CTE
WITH cte AS (SELECT ...), cte2 AS (SELECT ... FROM cte)
SELECT * FROM cte2;

-- Recursive CTE
WITH RECURSIVE tree AS (
    SELECT ... WHERE parent_id IS NULL           -- anchor
    UNION ALL
    SELECT ... JOIN tree ON parent_id = tree.id  -- recursive
)

-- Conditional aggregate
COUNT(*) FILTER (WHERE status = 'completed')
SUM(amount) FILTER (WHERE region = 'West')

-- Deduplication
WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) AS rn FROM t)
SELECT * FROM r WHERE rn = 1;

-- Top-N per group: same pattern, WHERE rn <= N

7. Exercises¶

Window ranking: Write a query that returns the top-3 products by revenue for each category. Use DENSE_RANK() and filter in a CTE.
Moving average: Compute a 7-day moving average of order count (by order_date). Handle days with no orders (you may need a calendar CTE or just show the available dates).
Recursive CTE: Modify the org-chart query to also return the total salary budget under each manager (sum of all direct and indirect reports’ salaries).
Cohort analysis: Group customers by the month they signed up (signup_date). For each cohort, calculate the total revenue they generated in their first month vs. second month. Use LAG to compute the retention change.
Gaps and Islands (advanced): Find all customers who placed orders in 3 or more consecutive months. Return their customer_id, the start month, and the length of the consecutive streak.