Advanced SQL: Window Functions, CTEs, and Query Patterns That Actually MatterΒΆ
Beyond SELECT * FROM table β window functions, CTEs, recursive queries, and lateral joins are what separate junior from senior SQL. This notebook covers the patterns that appear in every data engineering interview and production pipeline.
1. Setup: Synthetic E-Commerce DatabaseΒΆ
We use Pythonβs built-in sqlite3 module β no installation needed. Weβll create three tables that mirror a real e-commerce schema:
customers β who bought
products β what was bought
orders β the transaction record
All SQL in this notebook runs against a real SQLite database in memory.
import sqlite3
import pandas as pd
from datetime import date, timedelta
import random
random.seed(42)
# In-memory SQLite database
conn = sqlite3.connect(':memory:')
conn.row_factory = sqlite3.Row # named columns
cur = conn.cursor()
# Helper: run SQL and return a DataFrame
def sql(query, params=None):
if params:
return pd.read_sql_query(query, conn, params=params)
return pd.read_sql_query(query, conn)
print('SQLite version:', sqlite3.sqlite_version)
# ---------- DDL ----------
cur.executescript("""
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS employees;
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
region TEXT NOT NULL,
signup_date TEXT NOT NULL
);
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
category TEXT NOT NULL,
price REAL NOT NULL
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL,
order_date TEXT NOT NULL,
status TEXT NOT NULL
);
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
manager_id INTEGER,
department TEXT,
salary REAL
);
""")
# ---------- Seed Data ----------
regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
customers_data = [
(i, f'Customer_{i}', random.choice(regions),
str(date(2022, 1, 1) + timedelta(days=random.randint(0, 365))))
for i in range(1, 51)
]
products_data = [
(i, f'Product_{i}', random.choice(categories), round(random.uniform(5, 500), 2))
for i in range(1, 21)
]
start = date(2023, 1, 1)
statuses = ['completed', 'completed', 'completed', 'returned', 'pending']
orders_data = [
(i, random.randint(1, 50), random.randint(1, 20),
random.randint(1, 5),
str(start + timedelta(days=random.randint(0, 364))),
random.choice(statuses))
for i in range(1, 301)
]
# Employee org-chart (for recursive CTE)
employees_data = [
(1, 'Alice (CEO)', None, 'Executive', 250000),
(2, 'Bob (VP Eng)', 1, 'Engineering', 180000),
(3, 'Carol (VP Mkt)', 1, 'Marketing', 170000),
(4, 'Dave (EM)', 2, 'Engineering', 140000),
(5, 'Eve (EM)', 2, 'Engineering', 138000),
(6, 'Frank (IC)', 4, 'Engineering', 120000),
(7, 'Grace (IC)', 4, 'Engineering', 118000),
(8, 'Hank (IC)', 5, 'Engineering', 115000),
(9, 'Ivy (Mkt Mgr)', 3, 'Marketing', 130000),
(10,'Jack (Mkt IC)', 9, 'Marketing', 105000),
]
cur.executemany('INSERT INTO customers VALUES (?,?,?,?)', customers_data)
cur.executemany('INSERT INTO products VALUES (?,?,?,?)', products_data)
cur.executemany('INSERT INTO orders VALUES (?,?,?,?,?,?)', orders_data)
cur.executemany('INSERT INTO employees VALUES (?,?,?,?,?)', employees_data)
conn.commit()
print(f'customers: {cur.execute("SELECT COUNT(*) FROM customers").fetchone()[0]}')
print(f'products: {cur.execute("SELECT COUNT(*) FROM products").fetchone()[0]}')
print(f'orders: {cur.execute("SELECT COUNT(*) FROM orders").fetchone()[0]}')
print(f'employees: {cur.execute("SELECT COUNT(*) FROM employees").fetchone()[0]}')
# Preview the tables
print('--- orders (first 5) ---')
display(sql('SELECT * FROM orders LIMIT 5'))
print('--- customers (first 5) ---')
display(sql('SELECT * FROM customers LIMIT 5'))
print('--- products (first 5) ---')
display(sql('SELECT * FROM products LIMIT 5'))
2. Window FunctionsΒΆ
Window functions compute a value across a window (set of rows related to the current row) without collapsing rows the way GROUP BY does.
FUNCTION() OVER (
PARTITION BY <column> -- optional: reset window per group
ORDER BY <column> -- optional: order within window
ROWS/RANGE BETWEEN ... -- optional: frame definition
)
Category of window functions:
Category |
Functions |
|---|---|
Ranking |
ROW_NUMBER, RANK, DENSE_RANK, NTILE |
Offset |
LAG, LEAD, FIRST_VALUE, LAST_VALUE |
Aggregate |
SUM, AVG, MIN, MAX, COUNT (with OVER) |
# --- ROW_NUMBER, RANK, DENSE_RANK ---
# Business question: Rank customers by total spend. Handle ties correctly.
query_ranking = """
WITH customer_spend AS (
SELECT
c.customer_id,
c.name,
c.region,
ROUND(SUM(o.quantity * p.price), 2) AS total_spend
FROM customers c
JOIN orders o USING (customer_id)
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name, c.region
)
SELECT
name,
region,
total_spend,
ROW_NUMBER() OVER (ORDER BY total_spend DESC) AS row_num,
RANK() OVER (ORDER BY total_spend DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY total_spend DESC) AS dense_rank_val
FROM customer_spend
ORDER BY total_spend DESC
LIMIT 10
"""
print('ROW_NUMBER vs RANK vs DENSE_RANK:')
print(' ROW_NUMBER: always unique (1,2,3,4...)')
print(' RANK: skips after tie (1,2,2,4...)')
print(' DENSE_RANK: no gaps after tie (1,2,2,3...)')
sql(query_ranking)
# --- ROW_NUMBER with PARTITION BY ---
# Business question: Top-spending customer per region
query_top_per_region = """
WITH customer_spend AS (
SELECT
c.customer_id,
c.name,
c.region,
ROUND(SUM(o.quantity * p.price), 2) AS total_spend
FROM customers c
JOIN orders o USING (customer_id)
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name, c.region
),
ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_spend DESC) AS rn
FROM customer_spend
)
SELECT region, name, total_spend
FROM ranked
WHERE rn = 1
ORDER BY region
"""
print('Top-spending customer per region (PARTITION BY trick):')
sql(query_top_per_region)
# --- NTILE ---
# Business question: Bucket customers into spending quartiles
query_ntile = """
WITH customer_spend AS (
SELECT
c.customer_id,
c.name,
ROUND(SUM(o.quantity * p.price), 2) AS total_spend
FROM customers c
JOIN orders o USING (customer_id)
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name
)
SELECT
name,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_spend
ORDER BY total_spend DESC
LIMIT 12
"""
print('NTILE(4): splits rows into 4 equal buckets (Q1=highest spenders)')
sql(query_ntile)
# --- LAG / LEAD ---
# Business question: Month-over-month revenue change
query_lag_lead = """
WITH monthly_revenue AS (
SELECT
SUBSTR(order_date, 1, 7) AS month,
ROUND(SUM(o.quantity * p.price), 2) AS revenue
FROM orders o
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY month
ORDER BY month
)
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_rev,
LEAD(revenue, 1) OVER (ORDER BY month) AS next_month_rev,
ROUND(
(revenue - LAG(revenue,1) OVER (ORDER BY month))
/ LAG(revenue,1) OVER (ORDER BY month) * 100, 1
) AS mom_pct_change
FROM monthly_revenue
"""
print('LAG(col, n): looks back n rows | LEAD(col, n): looks forward n rows')
sql(query_lag_lead)
# --- Running Total and Moving Average (ROWS BETWEEN frame) ---
# Business question: Cumulative revenue and 3-month moving average
query_running = """
WITH monthly_revenue AS (
SELECT
SUBSTR(order_date, 1, 7) AS month,
ROUND(SUM(o.quantity * p.price), 2) AS revenue
FROM orders o
JOIN products p USING (product_id)
WHERE o.status = 'completed'
GROUP BY month
ORDER BY month
)
SELECT
month,
revenue,
ROUND(SUM(revenue) OVER (ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 2) AS cumulative_rev,
ROUND(AVG(revenue) OVER (ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2) AS moving_avg_3m
FROM monthly_revenue
"""
print('Frame options:')
print(' UNBOUNDED PRECEDING β from first row')
print(' N PRECEDING β N rows back')
print(' CURRENT ROW β the current row')
print(' N FOLLOWING β N rows ahead')
print(' UNBOUNDED FOLLOWING β to last row')
print()
sql(query_running)
3. Common Table Expressions (CTEs)ΒΆ
CTEs (introduced with WITH) give intermediate result sets a name. They improve readability and enable recursive queries.
Syntax:
WITH cte_name AS (
SELECT ...
),
another_cte AS (
SELECT ... FROM cte_name ...
)
SELECT * FROM another_cte;
CTEs vs subqueries: CTEs are named (reusable), readable (top-to-bottom flow), and required for recursive queries.
# --- Basic CTE ---
# Find products with above-average revenue in their category
query_basic_cte = """
WITH product_revenue AS (
SELECT
p.product_id,
p.name AS product_name,
p.category,
p.price,
SUM(o.quantity) AS units_sold,
ROUND(SUM(o.quantity * p.price), 2) AS revenue
FROM products p
JOIN orders o USING (product_id)
WHERE o.status = 'completed'
GROUP BY p.product_id, p.name, p.category, p.price
),
category_avg AS (
SELECT category, ROUND(AVG(revenue), 2) AS avg_category_revenue
FROM product_revenue
GROUP BY category
)
SELECT
pr.product_name,
pr.category,
pr.revenue,
ca.avg_category_revenue,
ROUND(pr.revenue - ca.avg_category_revenue, 2) AS above_avg_by
FROM product_revenue pr
JOIN category_avg ca USING (category)
WHERE pr.revenue > ca.avg_category_revenue
ORDER BY above_avg_by DESC
"""
print('Chained CTEs: product_revenue β category_avg β final join')
sql(query_basic_cte)
# --- Recursive CTE: Employee Org-Chart Hierarchy ---
# Find all reports under each manager at every level
query_recursive = """
WITH RECURSIVE org_chart AS (
-- Anchor: start with the CEO (no manager)
SELECT
emp_id,
name,
manager_id,
0 AS level,
name AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive step: each employee's direct reports
SELECT
e.emp_id,
e.name,
e.manager_id,
oc.level + 1,
oc.path || ' β ' || e.name
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.emp_id
)
SELECT
REPEAT(' ', level) || name AS indented_name,
level,
path
FROM org_chart
ORDER BY path
"""
print('Recursive CTE structure:')
print(' 1. Anchor member β base case (WHERE manager_id IS NULL)')
print(' 2. UNION ALL')
print(' 3. Recursive member β joins back to the CTE itself')
print()
result = sql(query_recursive)
for _, row in result.iterrows():
print(row['indented_name'])
4. Advanced Aggregations: ROLLUP, CUBE, GROUPING SETS, FILTERΒΆ
Standard GROUP BY produces one row per unique combination of grouped columns. These extensions produce multiple levels of aggregation in one query.
Clause |
What it produces |
|---|---|
|
(a,b), (a), () β hierarchical subtotals |
|
all combinations: (a,b), (a), (b), () |
|
explicit list of groupings |
|
conditional aggregation in one pass |
SQLite supports
GROUPING SETSvia workaround andFILTERnatively.
# --- FILTER clause (conditional aggregation) ---
# Business question: pivot order counts by status in one query
query_filter = """
SELECT
p.category,
COUNT(*) AS total_orders,
COUNT(*) FILTER (WHERE o.status='completed') AS completed,
COUNT(*) FILTER (WHERE o.status='returned') AS returned,
COUNT(*) FILTER (WHERE o.status='pending') AS pending,
ROUND(
100.0 * COUNT(*) FILTER (WHERE o.status='returned')
/ COUNT(*), 1
) AS return_rate_pct
FROM orders o
JOIN products p USING (product_id)
GROUP BY p.category
ORDER BY return_rate_pct DESC
"""
print('FILTER clause: conditional aggregate without CASE WHEN clutter')
sql(query_filter)
# --- GROUPING SETS emulation in SQLite via UNION ALL ---
# (SQLite doesn't have native ROLLUP/CUBE but the concept is identical)
# Business question: revenue by (region, category), (region), (category), (grand total)
query_grouping_sets = """
SELECT region, category, ROUND(SUM(o.quantity * p.price),2) AS revenue, 'region+category' AS grouping_level
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY region, category
UNION ALL
SELECT region, 'ALL', ROUND(SUM(o.quantity * p.price),2), 'region_subtotal'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY region
UNION ALL
SELECT 'ALL', category, ROUND(SUM(o.quantity * p.price),2), 'category_subtotal'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
GROUP BY category
UNION ALL
SELECT 'ALL', 'ALL', ROUND(SUM(o.quantity * p.price),2), 'grand_total'
FROM customers c JOIN orders o USING(customer_id) JOIN products p USING(product_id)
WHERE o.status='completed'
ORDER BY grouping_level, region, category
"""
print('GROUPING SETS equivalent (ROLLUP simulation via UNION ALL):')
print('In PostgreSQL/MySQL you would write: GROUP BY ROLLUP(region, category)')
sql(query_grouping_sets)
6. Practical PatternsΒΆ
6.1 Deduplication with ROW_NUMBERΒΆ
6.2 Gaps and IslandsΒΆ
6.3 Running PercentilesΒΆ
# --- Pattern 1: Deduplication with ROW_NUMBER ---
# Keep only the most recent order per customer
query_dedup = """
WITH ranked_orders AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC, order_id DESC
) AS rn
FROM orders
WHERE status = 'completed'
)
SELECT order_id, customer_id, product_id, quantity, order_date, status
FROM ranked_orders
WHERE rn = 1
ORDER BY customer_id
LIMIT 10
"""
print('Deduplication: keep latest completed order per customer')
print('Key: PARTITION BY the group, ORDER BY the tiebreaker, filter WHERE rn = 1')
sql(query_dedup)
# --- Pattern 2: Gaps and Islands ---
# Find consecutive months with orders (islands) and gaps between them
# First, create a simple monthly sequence to detect gaps
query_gaps_islands = """
WITH monthly_orders AS (
SELECT DISTINCT SUBSTR(order_date, 1, 7) AS month
FROM orders
ORDER BY month
),
with_lag AS (
SELECT
month,
LAG(month) OVER (ORDER BY month) AS prev_month
FROM monthly_orders
),
-- Island detection: assign a group number that increments on each gap
islands AS (
SELECT
month,
prev_month,
CASE
WHEN prev_month IS NULL THEN 'start'
WHEN CAST(SUBSTR(month,6,2) AS INT) -
CAST(SUBSTR(prev_month,6,2) AS INT) > 1
OR CAST(SUBSTR(month,1,4) AS INT) >
CAST(SUBSTR(prev_month,1,4) AS INT)
THEN 'gap'
ELSE 'continuous'
END AS sequence_type
FROM with_lag
)
SELECT month, prev_month, sequence_type
FROM islands
ORDER BY month
"""
print('Gaps and Islands: detect breaks in a sequence')
sql(query_gaps_islands)
# --- Pattern 3: Running Percentile (approximate) ---
# Assign each order to a percentile bucket based on order value
query_percentile = """
WITH order_values AS (
SELECT
o.order_id,
o.customer_id,
ROUND(o.quantity * p.price, 2) AS order_value
FROM orders o
JOIN products p USING (product_id)
WHERE o.status = 'completed'
),
percentiled AS (
SELECT *,
NTILE(100) OVER (ORDER BY order_value) AS percentile,
NTILE(10) OVER (ORDER BY order_value) AS decile
FROM order_values
)
SELECT
decile,
COUNT(*) AS orders_in_decile,
ROUND(MIN(order_value), 2) AS min_value,
ROUND(AVG(order_value), 2) AS avg_value,
ROUND(MAX(order_value), 2) AS max_value
FROM percentiled
GROUP BY decile
ORDER BY decile
"""
print('Percentile distribution of order values by decile:')
sql(query_percentile)
Cheat SheetΒΆ
-- Window function template
FUNC() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN N PRECEDING AND CURRENT ROW)
-- Ranking: ROW_NUMBER (unique) | RANK (gaps) | DENSE_RANK (no gaps) | NTILE(n)
-- Offset: LAG(col, n, default) | LEAD(col, n, default)
-- Frames: UNBOUNDED PRECEDING | N PRECEDING | CURRENT ROW | N FOLLOWING | UNBOUNDED FOLLOWING
-- CTE
WITH cte AS (SELECT ...), cte2 AS (SELECT ... FROM cte)
SELECT * FROM cte2;
-- Recursive CTE
WITH RECURSIVE tree AS (
SELECT ... WHERE parent_id IS NULL -- anchor
UNION ALL
SELECT ... JOIN tree ON parent_id = tree.id -- recursive
)
-- Conditional aggregate
COUNT(*) FILTER (WHERE status = 'completed')
SUM(amount) FILTER (WHERE region = 'West')
-- Deduplication
WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) AS rn FROM t)
SELECT * FROM r WHERE rn = 1;
-- Top-N per group: same pattern, WHERE rn <= N
7. ExercisesΒΆ
Window ranking: Write a query that returns the top-3 products by revenue for each category. Use
DENSE_RANK()and filter in a CTE.Moving average: Compute a 7-day moving average of order count (by order_date). Handle days with no orders (you may need a calendar CTE or just show the available dates).
Recursive CTE: Modify the org-chart query to also return the total salary budget under each manager (sum of all direct and indirect reportsβ salaries).
Cohort analysis: Group customers by the month they signed up (
signup_date). For each cohort, calculate the total revenue they generated in their first month vs. second month. Use LAG to compute the retention change.Gaps and Islands (advanced): Find all customers who placed orders in 3 or more consecutive months. Return their customer_id, the start month, and the length of the consecutive streak.