Run this notebook: Open in Colab Open in Kaggle

AI Foundations: Symbolic vs Non-Symbolic AI & Control Theory¶

Source: The Math Behind Artificial Intelligence — Chapter 3

Overview¶

Before modern deep learning, AI took two radically different paths — and control theory was doing AI before “AI” was even a field.

This notebook covers:

What is Artificial Intelligence? — a precise definition
Symbolic AI (GOFAI) — rule-based reasoning
Non-Symbolic AI — statistical learning & neural networks
Control Theory as the “First AI” — feedback systems
PID controllers — math of classical control
From control theory → reinforcement learning

1. What is Artificial Intelligence?¶

A precise definition and the landscape of AI approaches¶

AI is the development of systems that perform tasks which, when done by humans, would require intelligence. This deceptively simple definition encompasses everything from a thermostat (which makes decisions based on sensory input) to a large language model (which generates coherent text from a prompt). The field has historically organized itself around three levels of capability:

Level	Name	Description	Status
1	Narrow AI (ANI)	Solves one specific task (chess, image classification)	Achieved
2	General AI (AGI)	Performs any intellectual task a human can	In progress
3	Super AI (ASI)	Surpasses human intelligence in all domains	Theoretical

Two fundamental paradigms have competed to achieve AI, each with its own mathematical foundations:

Symbolic AI: encode intelligence as rules and logic (rooted in formal logic and discrete mathematics)
Non-Symbolic AI: learn intelligence from data (rooted in statistics, optimization, and linear algebra)

Understanding both paradigms – and how modern systems increasingly hybridize them – is essential for appreciating why the mathematical foundations covered in this curriculum matter.

2. Symbolic AI (GOFAI — Good Old-Fashioned AI)¶

Symbolic AI represents knowledge as symbols, rules, and logic — the way humans explicitly describe reasoning.

Key idea: If you can write down the rules, you can build an intelligent system.

Examples:

Expert systems (1970s-80s): MYCIN (medical diagnosis), DENDRAL (chemistry)
Prolog programs
Decision trees with hand-crafted features
Chess engines (Deep Blue used hybrid symbolic + search)

Strength: Interpretable, provably correct for defined domains

Weakness: The knowledge acquisition bottleneck — you can’t write rules for everything

# --- Symbolic AI: A simple expert system for loan approval ---

def expert_loan_system(income, credit_score, debt_ratio, employment_years):
    """
    Hand-crafted rule-based expert system.
    Represents symbolic AI: explicit if-then rules.
    """
    reasons = []
    
    # Rule 1: Minimum income
    if income < 30000:
        reasons.append("Income below minimum ($30k)")
    
    # Rule 2: Credit score threshold
    if credit_score < 620:
        reasons.append(f"Credit score {credit_score} below minimum (620)")
    
    # Rule 3: Debt-to-income ratio
    if debt_ratio > 0.43:
        reasons.append(f"Debt ratio {debt_ratio:.0%} exceeds limit (43%)")
    
    # Rule 4: Employment stability
    if employment_years < 2:
        reasons.append(f"Employment history {employment_years}y below minimum (2y)")
    
    # Composite rule: exceptional credit overrides income rule
    if credit_score >= 750 and income >= 25000:
        reasons = [r for r in reasons if "Income" not in r]
    
    approved = len(reasons) == 0
    return approved, reasons


# Test applicants
applicants = [
    {"name": "Alice",   "income": 75000, "credit_score": 720, "debt_ratio": 0.30, "employment_years": 5},
    {"name": "Bob",     "income": 28000, "credit_score": 580, "debt_ratio": 0.50, "employment_years": 1},
    {"name": "Charlie", "income": 26000, "credit_score": 760, "debt_ratio": 0.35, "employment_years": 3},
]

print("Expert System — Loan Approval Decisions")
print("=" * 50)
for a in applicants:
    approved, reasons = expert_loan_system(
        a["income"], a["credit_score"], a["debt_ratio"], a["employment_years"]
    )
    status = "✅ APPROVED" if approved else "❌ DENIED"
    print(f"\n{a['name']}: {status}")
    if reasons:
        for r in reasons:
            print(f"  - {r}")

3. Non-Symbolic AI — Statistical & Neural Learning¶

Non-symbolic AI learns rules from data rather than having them hand-coded.

Key idea: Given enough examples, a statistical model discovers its own patterns.

Three generations:

Statistical ML (1980s-2000s): SVMs, decision trees, random forests
Deep Learning (2012-present): CNNs, RNNs, Transformers
Foundation Models (2020-present): GPT, BERT, CLIP, Gemini

Strength: Handles perceptual problems (vision, speech, language) that are impossible to rule-encode

Weakness: Requires large data, black-box, can fail unpredictably

# --- Non-Symbolic AI: Same loan problem, learned from data ---
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Generate synthetic training data
np.random.seed(42)
n = 1000

income        = np.random.normal(55000, 20000, n).clip(15000, 150000)
credit_score  = np.random.normal(680, 80, n).clip(400, 850)
debt_ratio    = np.random.uniform(0.1, 0.6, n)
emp_years     = np.random.exponential(4, n).clip(0, 20)

# True outcome (based on similar logic to expert system, plus noise)
score = (income/100000)*0.3 + (credit_score/850)*0.4 + (1-debt_ratio)*0.2 + (emp_years/20)*0.1
approved = (score + np.random.normal(0, 0.05, n)) > 0.5

X = np.column_stack([income, credit_score, debt_ratio, emp_years])
y = approved.astype(int)

# Train a logistic regression model
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression()
model.fit(X_scaled, y)

print("Non-Symbolic AI — Learned Model Coefficients")
print("(model discovered these from data, no rules were hard-coded)")
print()
features = ['Income', 'Credit Score', 'Debt Ratio', 'Employment Years']
for feat, coef in zip(features, model.coef_[0]):
    direction = "↑ more → approved" if coef > 0 else "↑ more → denied"
    print(f"  {feat:20s}: coef={coef:+.3f}  ({direction})")

# Predict same applicants
print("\nPredictions on same applicants:")
for a in applicants:
    x = scaler.transform([[a['income'], a['credit_score'], a['debt_ratio'], a['employment_years']]])
    prob = model.predict_proba(x)[0][1]
    status = "✅ APPROVED" if prob > 0.5 else "❌ DENIED"
    print(f"  {a['name']}: {status}  (probability={prob:.2%})")

4. Control Theory — The “First AI”¶

Control theory studies how to make a system behave the way you want using feedback loops.

Developed in the 1940s-50s (Norbert Wiener’s Cybernetics, 1948), control theory solved intelligent adaptive behavior before AI was formalized.

       Goal (setpoint)
           │
           ▼
   ┌── [Controller] ──→ [System/Plant] ──→ Output
   │         ↑                                │
   │      Error                               │
   └──── [Sensor] ←──────────────────────────┘
              (Feedback)

Examples: Cruise control, thermostat, autopilot, insulin pump, rocket guidance

Connection to ML: Gradient descent is a feedback control loop:

Setpoint = 0 loss
Error = current loss
Controller = optimizer (SGD, Adam)
System = neural network

# --- PID Controller: The classic control theory algorithm ---
import matplotlib.pyplot as plt

class PIDController:
    """
    Proportional-Integral-Derivative (PID) controller.
    The foundational algorithm of control theory.
    
    Output = Kp*error + Ki*integral(error) + Kd*derivative(error)
    """
    def __init__(self, Kp, Ki, Kd, dt=0.1):
        self.Kp = Kp   # Proportional gain: react to current error
        self.Ki = Ki   # Integral gain: react to accumulated past error
        self.Kd = Kd   # Derivative gain: react to rate of error change
        self.dt = dt
        self.integral = 0
        self.prev_error = 0
    
    def step(self, setpoint, measured_value):
        error = setpoint - measured_value
        
        # P: proportional to current error
        P = self.Kp * error
        
        # I: integral of past errors (eliminates steady-state error)
        self.integral += error * self.dt
        I = self.Ki * self.integral
        
        # D: derivative (predicts future error, dampens oscillation)
        derivative = (error - self.prev_error) / self.dt
        D = self.Kd * derivative
        
        self.prev_error = error
        return P + I + D


# Simulate a temperature control system
# Target: reach 70°C from 20°C, system has thermal inertia
def simulate_temperature_control(Kp, Ki, Kd, setpoint=70.0, steps=200):
    pid = PIDController(Kp, Ki, Kd, dt=0.1)
    temp = 20.0
    temps = [temp]
    
    for _ in range(steps):
        control = pid.step(setpoint, temp)
        # System dynamics: temperature changes proportionally to control input
        # but with inertia (slow response) and cooling (drag)
        temp += 0.1 * (control * 0.5 - (temp - 20) * 0.02)
        temps.append(temp)
    
    return temps

# Compare different PID tunings
configs = [
    ("P only (Kp=1.0)",         1.0, 0.0, 0.0, 'blue'),
    ("PD (Kp=1.0, Kd=0.5)",    1.0, 0.0, 0.5, 'green'),
    ("PID (Kp=1.0, Ki=0.1, Kd=0.5)", 1.0, 0.1, 0.5, 'red'),
]

t = np.arange(0, 20.1, 0.1)
fig, ax = plt.subplots(figsize=(12, 5))

ax.axhline(y=70, color='black', linestyle='--', linewidth=2, label='Setpoint (70°C)', alpha=0.7)
for label, Kp, Ki, Kd, color in configs:
    temps = simulate_temperature_control(Kp, Ki, Kd)
    ax.plot(t, temps, color=color, linewidth=2, label=label)

ax.set_xlabel('Time (seconds)', fontsize=12)
ax.set_ylabel('Temperature (°C)', fontsize=12)
ax.set_title('PID Controller: Temperature Control', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(15, 85)
plt.tight_layout()
plt.show()

print("\nPID components:")
print("  P (Proportional): Reacts to current error — fast but may overshoot")
print("  I (Integral):     Eliminates steady-state error — fixes offset")
print("  D (Derivative):   Predicts future error — dampens oscillations")

5. From Control Theory to Machine Learning¶

The mathematical bridge between feedback systems and neural network training¶

The parallels between control theory and machine learning are not mere analogy – they reflect a deep mathematical kinship. Both fields solve the same fundamental problem: given a system with tunable parameters, how do you adjust those parameters to minimize the discrepancy between desired and actual behavior? The terminology differs, but the equations are strikingly similar.

Control Theory	Machine Learning
Setpoint / Reference	Target label / desired output
Error signal	Loss function
Controller	Optimizer (SGD, Adam)
Plant / System	Neural network
Feedback loop	Backpropagation
Stability analysis	Convergence proofs
PID gains (Kp, Ki, Kd)	Learning rate, momentum, \(\beta_1\), \(\beta_2\)
Integral windup	Gradient accumulation (Adam)
Derivative kick	Momentum in optimizers

The PID controller’s integral term accumulates past errors to eliminate steady-state offset, much like Adam’s first moment \(m_t\) accumulates gradient history. The derivative term anticipates future error by measuring the rate of change, analogous to how momentum dampens oscillations during training. Even modern concepts like learning rate schedules (warm-up, cosine annealing) mirror the gain scheduling techniques that control engineers have used for decades. Reinforcement learning makes this connection explicit: the Bellman equation in RL is a discrete-time version of control theory’s Hamilton-Jacobi-Bellman equation.

# --- Gradient Descent as a Control Loop ---
# Training a simple model IS running a PID-like feedback loop

# Simple 1D loss landscape: L(w) = (w - 3)^2
# True minimum at w = 3

def loss(w):
    return (w - 3.0) ** 2

def grad_loss(w):
    return 2 * (w - 3.0)

# Compare: plain gradient descent vs momentum (more like PID)
def gradient_descent(w0=10.0, lr=0.1, steps=30):
    w = w0
    history = [w]
    for _ in range(steps):
        w = w - lr * grad_loss(w)
        history.append(w)
    return history

def gradient_descent_momentum(w0=10.0, lr=0.1, beta=0.9, steps=30):
    w = w0
    v = 0  # velocity (integral term)
    history = [w]
    for _ in range(steps):
        v = beta * v + (1 - beta) * grad_loss(w)  # exponential moving average
        w = w - lr * v
        history.append(w)
    return history

gd = gradient_descent()
gd_mom = gradient_descent_momentum()

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Parameter trajectory
axes[0].plot(gd, 'bo-', markersize=4, label='Plain GD')
axes[0].plot(gd_mom, 'rs-', markersize=4, label='GD + Momentum')
axes[0].axhline(y=3, color='green', linestyle='--', label='Optimal w=3')
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Parameter w')
axes[0].set_title('Parameter Trajectory')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss trajectory
axes[1].semilogy([loss(w) for w in gd], 'bo-', markersize=4, label='Plain GD')
axes[1].semilogy([loss(w) for w in gd_mom], 'rs-', markersize=4, label='GD + Momentum')
axes[1].set_xlabel('Step')
axes[1].set_ylabel('Loss (log scale)')
axes[1].set_title('Loss Convergence')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Gradient descent = feedback control:")
print(f"  Setpoint: w* = 3.0")
print(f"  Final GD parameter: w = {gd[-1]:.6f}")
print(f"  Final Momentum parameter: w = {gd_mom[-1]:.6f}")

6. Symbolic vs Non-Symbolic: A Modern Perspective¶

Today’s most powerful AI systems are hybrids:

System	Symbolic component	Non-symbolic component
AlphaGo	Tree search (MCTS)	Value/policy networks
GPT + tools	Tool-use reasoning (code execution)	Transformer next-token prediction
AlphaCode	Syntax-correct code generation	LLM + test-based filtering
Theorem provers (Lean+AI)	Formal proof system	Neural proof search

The trend: Non-symbolic AI (LLMs) is increasingly incorporating symbolic reasoning through chain-of-thought, tool use, and formal verification.

Summary¶

Symbolic AI: explicit rules, interpretable, brittle for perception tasks
Non-symbolic AI: learned from data, powerful for perception, less interpretable
Control theory pioneered feedback learning before ML — same math, different framing
PID controller: P=react to error, I=fix accumulated error, D=predict future error
Gradient descent is a control feedback loop: error = loss, controller = optimizer

Exercises¶

Modify the expert loan system to add a new rule: if both income > 100k AND credit_score > 750, auto-approve regardless of other factors.
Which kind of AI (symbolic or non-symbolic) would you use for: (a) spam detection, (b) medical diagnosis requiring explanation, © real-time language translation? Justify.
In the PID simulation, what happens when you increase Kd (derivative gain) too much? Run the code and observe.
Map gradient descent to the PID framework: what corresponds to P, I, and D in a standard SGD update with momentum?
Reinforcement learning (RL) is often described as control theory + ML. Research: what is the “Bellman equation” and how does it relate to control theory’s Hamilton-Jacobi-Bellman equation?