AI Foundations: Symbolic vs Non-Symbolic AI & Control TheoryΒΆ
Source: The Math Behind Artificial Intelligence β Chapter 3
OverviewΒΆ
Before modern deep learning, AI took two radically different paths β and control theory was doing AI before βAIβ was even a field.
This notebook covers:
What is Artificial Intelligence? β a precise definition
Symbolic AI (GOFAI) β rule-based reasoning
Non-Symbolic AI β statistical learning & neural networks
Control Theory as the βFirst AIβ β feedback systems
PID controllers β math of classical control
From control theory β reinforcement learning
1. What is Artificial Intelligence?ΒΆ
A precise definition and the landscape of AI approachesΒΆ
AI is the development of systems that perform tasks which, when done by humans, would require intelligence. This deceptively simple definition encompasses everything from a thermostat (which makes decisions based on sensory input) to a large language model (which generates coherent text from a prompt). The field has historically organized itself around three levels of capability:
Level |
Name |
Description |
Status |
|---|---|---|---|
1 |
Narrow AI (ANI) |
Solves one specific task (chess, image classification) |
Achieved |
2 |
General AI (AGI) |
Performs any intellectual task a human can |
In progress |
3 |
Super AI (ASI) |
Surpasses human intelligence in all domains |
Theoretical |
Two fundamental paradigms have competed to achieve AI, each with its own mathematical foundations:
Symbolic AI: encode intelligence as rules and logic (rooted in formal logic and discrete mathematics)
Non-Symbolic AI: learn intelligence from data (rooted in statistics, optimization, and linear algebra)
Understanding both paradigms β and how modern systems increasingly hybridize them β is essential for appreciating why the mathematical foundations covered in this curriculum matter.
2. Symbolic AI (GOFAI β Good Old-Fashioned AI)ΒΆ
Symbolic AI represents knowledge as symbols, rules, and logic β the way humans explicitly describe reasoning.
Key idea: If you can write down the rules, you can build an intelligent system.
Examples:
Expert systems (1970s-80s): MYCIN (medical diagnosis), DENDRAL (chemistry)
Prolog programs
Decision trees with hand-crafted features
Chess engines (Deep Blue used hybrid symbolic + search)
Strength: Interpretable, provably correct for defined domains
Weakness: The knowledge acquisition bottleneck β you canβt write rules for everything
# --- Symbolic AI: A simple expert system for loan approval ---
def expert_loan_system(income, credit_score, debt_ratio, employment_years):
"""
Hand-crafted rule-based expert system.
Represents symbolic AI: explicit if-then rules.
"""
reasons = []
# Rule 1: Minimum income
if income < 30000:
reasons.append("Income below minimum ($30k)")
# Rule 2: Credit score threshold
if credit_score < 620:
reasons.append(f"Credit score {credit_score} below minimum (620)")
# Rule 3: Debt-to-income ratio
if debt_ratio > 0.43:
reasons.append(f"Debt ratio {debt_ratio:.0%} exceeds limit (43%)")
# Rule 4: Employment stability
if employment_years < 2:
reasons.append(f"Employment history {employment_years}y below minimum (2y)")
# Composite rule: exceptional credit overrides income rule
if credit_score >= 750 and income >= 25000:
reasons = [r for r in reasons if "Income" not in r]
approved = len(reasons) == 0
return approved, reasons
# Test applicants
applicants = [
{"name": "Alice", "income": 75000, "credit_score": 720, "debt_ratio": 0.30, "employment_years": 5},
{"name": "Bob", "income": 28000, "credit_score": 580, "debt_ratio": 0.50, "employment_years": 1},
{"name": "Charlie", "income": 26000, "credit_score": 760, "debt_ratio": 0.35, "employment_years": 3},
]
print("Expert System β Loan Approval Decisions")
print("=" * 50)
for a in applicants:
approved, reasons = expert_loan_system(
a["income"], a["credit_score"], a["debt_ratio"], a["employment_years"]
)
status = "β
APPROVED" if approved else "β DENIED"
print(f"\n{a['name']}: {status}")
if reasons:
for r in reasons:
print(f" - {r}")
3. Non-Symbolic AI β Statistical & Neural LearningΒΆ
Non-symbolic AI learns rules from data rather than having them hand-coded.
Key idea: Given enough examples, a statistical model discovers its own patterns.
Three generations:
Statistical ML (1980s-2000s): SVMs, decision trees, random forests
Deep Learning (2012-present): CNNs, RNNs, Transformers
Foundation Models (2020-present): GPT, BERT, CLIP, Gemini
Strength: Handles perceptual problems (vision, speech, language) that are impossible to rule-encode
Weakness: Requires large data, black-box, can fail unpredictably
# --- Non-Symbolic AI: Same loan problem, learned from data ---
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic training data
np.random.seed(42)
n = 1000
income = np.random.normal(55000, 20000, n).clip(15000, 150000)
credit_score = np.random.normal(680, 80, n).clip(400, 850)
debt_ratio = np.random.uniform(0.1, 0.6, n)
emp_years = np.random.exponential(4, n).clip(0, 20)
# True outcome (based on similar logic to expert system, plus noise)
score = (income/100000)*0.3 + (credit_score/850)*0.4 + (1-debt_ratio)*0.2 + (emp_years/20)*0.1
approved = (score + np.random.normal(0, 0.05, n)) > 0.5
X = np.column_stack([income, credit_score, debt_ratio, emp_years])
y = approved.astype(int)
# Train a logistic regression model
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression()
model.fit(X_scaled, y)
print("Non-Symbolic AI β Learned Model Coefficients")
print("(model discovered these from data, no rules were hard-coded)")
print()
features = ['Income', 'Credit Score', 'Debt Ratio', 'Employment Years']
for feat, coef in zip(features, model.coef_[0]):
direction = "β more β approved" if coef > 0 else "β more β denied"
print(f" {feat:20s}: coef={coef:+.3f} ({direction})")
# Predict same applicants
print("\nPredictions on same applicants:")
for a in applicants:
x = scaler.transform([[a['income'], a['credit_score'], a['debt_ratio'], a['employment_years']]])
prob = model.predict_proba(x)[0][1]
status = "β
APPROVED" if prob > 0.5 else "β DENIED"
print(f" {a['name']}: {status} (probability={prob:.2%})")
4. Control Theory β The βFirst AIβΒΆ
Control theory studies how to make a system behave the way you want using feedback loops.
Developed in the 1940s-50s (Norbert Wienerβs Cybernetics, 1948), control theory solved intelligent adaptive behavior before AI was formalized.
Goal (setpoint)
β
βΌ
βββ [Controller] βββ [System/Plant] βββ Output
β β β
β Error β
βββββ [Sensor] ββββββββββββββββββββββββββββ
(Feedback)
Examples: Cruise control, thermostat, autopilot, insulin pump, rocket guidance
Connection to ML: Gradient descent is a feedback control loop:
Setpoint = 0 loss
Error = current loss
Controller = optimizer (SGD, Adam)
System = neural network
# --- PID Controller: The classic control theory algorithm ---
import matplotlib.pyplot as plt
class PIDController:
"""
Proportional-Integral-Derivative (PID) controller.
The foundational algorithm of control theory.
Output = Kp*error + Ki*integral(error) + Kd*derivative(error)
"""
def __init__(self, Kp, Ki, Kd, dt=0.1):
self.Kp = Kp # Proportional gain: react to current error
self.Ki = Ki # Integral gain: react to accumulated past error
self.Kd = Kd # Derivative gain: react to rate of error change
self.dt = dt
self.integral = 0
self.prev_error = 0
def step(self, setpoint, measured_value):
error = setpoint - measured_value
# P: proportional to current error
P = self.Kp * error
# I: integral of past errors (eliminates steady-state error)
self.integral += error * self.dt
I = self.Ki * self.integral
# D: derivative (predicts future error, dampens oscillation)
derivative = (error - self.prev_error) / self.dt
D = self.Kd * derivative
self.prev_error = error
return P + I + D
# Simulate a temperature control system
# Target: reach 70Β°C from 20Β°C, system has thermal inertia
def simulate_temperature_control(Kp, Ki, Kd, setpoint=70.0, steps=200):
pid = PIDController(Kp, Ki, Kd, dt=0.1)
temp = 20.0
temps = [temp]
for _ in range(steps):
control = pid.step(setpoint, temp)
# System dynamics: temperature changes proportionally to control input
# but with inertia (slow response) and cooling (drag)
temp += 0.1 * (control * 0.5 - (temp - 20) * 0.02)
temps.append(temp)
return temps
# Compare different PID tunings
configs = [
("P only (Kp=1.0)", 1.0, 0.0, 0.0, 'blue'),
("PD (Kp=1.0, Kd=0.5)", 1.0, 0.0, 0.5, 'green'),
("PID (Kp=1.0, Ki=0.1, Kd=0.5)", 1.0, 0.1, 0.5, 'red'),
]
t = np.arange(0, 20.1, 0.1)
fig, ax = plt.subplots(figsize=(12, 5))
ax.axhline(y=70, color='black', linestyle='--', linewidth=2, label='Setpoint (70Β°C)', alpha=0.7)
for label, Kp, Ki, Kd, color in configs:
temps = simulate_temperature_control(Kp, Ki, Kd)
ax.plot(t, temps, color=color, linewidth=2, label=label)
ax.set_xlabel('Time (seconds)', fontsize=12)
ax.set_ylabel('Temperature (Β°C)', fontsize=12)
ax.set_title('PID Controller: Temperature Control', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(15, 85)
plt.tight_layout()
plt.show()
print("\nPID components:")
print(" P (Proportional): Reacts to current error β fast but may overshoot")
print(" I (Integral): Eliminates steady-state error β fixes offset")
print(" D (Derivative): Predicts future error β dampens oscillations")
5. From Control Theory to Machine LearningΒΆ
The mathematical bridge between feedback systems and neural network trainingΒΆ
The parallels between control theory and machine learning are not mere analogy β they reflect a deep mathematical kinship. Both fields solve the same fundamental problem: given a system with tunable parameters, how do you adjust those parameters to minimize the discrepancy between desired and actual behavior? The terminology differs, but the equations are strikingly similar.
Control Theory |
Machine Learning |
|---|---|
Setpoint / Reference |
Target label / desired output |
Error signal |
Loss function |
Controller |
Optimizer (SGD, Adam) |
Plant / System |
Neural network |
Feedback loop |
Backpropagation |
Stability analysis |
Convergence proofs |
PID gains (Kp, Ki, Kd) |
Learning rate, momentum, \(\beta_1\), \(\beta_2\) |
Integral windup |
Gradient accumulation (Adam) |
Derivative kick |
Momentum in optimizers |
The PID controllerβs integral term accumulates past errors to eliminate steady-state offset, much like Adamβs first moment \(m_t\) accumulates gradient history. The derivative term anticipates future error by measuring the rate of change, analogous to how momentum dampens oscillations during training. Even modern concepts like learning rate schedules (warm-up, cosine annealing) mirror the gain scheduling techniques that control engineers have used for decades. Reinforcement learning makes this connection explicit: the Bellman equation in RL is a discrete-time version of control theoryβs Hamilton-Jacobi-Bellman equation.
# --- Gradient Descent as a Control Loop ---
# Training a simple model IS running a PID-like feedback loop
# Simple 1D loss landscape: L(w) = (w - 3)^2
# True minimum at w = 3
def loss(w):
return (w - 3.0) ** 2
def grad_loss(w):
return 2 * (w - 3.0)
# Compare: plain gradient descent vs momentum (more like PID)
def gradient_descent(w0=10.0, lr=0.1, steps=30):
w = w0
history = [w]
for _ in range(steps):
w = w - lr * grad_loss(w)
history.append(w)
return history
def gradient_descent_momentum(w0=10.0, lr=0.1, beta=0.9, steps=30):
w = w0
v = 0 # velocity (integral term)
history = [w]
for _ in range(steps):
v = beta * v + (1 - beta) * grad_loss(w) # exponential moving average
w = w - lr * v
history.append(w)
return history
gd = gradient_descent()
gd_mom = gradient_descent_momentum()
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
# Parameter trajectory
axes[0].plot(gd, 'bo-', markersize=4, label='Plain GD')
axes[0].plot(gd_mom, 'rs-', markersize=4, label='GD + Momentum')
axes[0].axhline(y=3, color='green', linestyle='--', label='Optimal w=3')
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Parameter w')
axes[0].set_title('Parameter Trajectory')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Loss trajectory
axes[1].semilogy([loss(w) for w in gd], 'bo-', markersize=4, label='Plain GD')
axes[1].semilogy([loss(w) for w in gd_mom], 'rs-', markersize=4, label='GD + Momentum')
axes[1].set_xlabel('Step')
axes[1].set_ylabel('Loss (log scale)')
axes[1].set_title('Loss Convergence')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Gradient descent = feedback control:")
print(f" Setpoint: w* = 3.0")
print(f" Final GD parameter: w = {gd[-1]:.6f}")
print(f" Final Momentum parameter: w = {gd_mom[-1]:.6f}")
6. Symbolic vs Non-Symbolic: A Modern PerspectiveΒΆ
Todayβs most powerful AI systems are hybrids:
System |
Symbolic component |
Non-symbolic component |
|---|---|---|
AlphaGo |
Tree search (MCTS) |
Value/policy networks |
GPT + tools |
Tool-use reasoning (code execution) |
Transformer next-token prediction |
AlphaCode |
Syntax-correct code generation |
LLM + test-based filtering |
Theorem provers (Lean+AI) |
Formal proof system |
Neural proof search |
The trend: Non-symbolic AI (LLMs) is increasingly incorporating symbolic reasoning through chain-of-thought, tool use, and formal verification.
SummaryΒΆ
Symbolic AI: explicit rules, interpretable, brittle for perception tasks
Non-symbolic AI: learned from data, powerful for perception, less interpretable
Control theory pioneered feedback learning before ML β same math, different framing
PID controller: P=react to error, I=fix accumulated error, D=predict future error
Gradient descent is a control feedback loop: error = loss, controller = optimizer
ExercisesΒΆ
Modify the expert loan system to add a new rule: if both income > 100k AND credit_score > 750, auto-approve regardless of other factors.
Which kind of AI (symbolic or non-symbolic) would you use for: (a) spam detection, (b) medical diagnosis requiring explanation, Β© real-time language translation? Justify.
In the PID simulation, what happens when you increase Kd (derivative gain) too much? Run the code and observe.
Map gradient descent to the PID framework: what corresponds to P, I, and D in a standard SGD update with momentum?
Reinforcement learning (RL) is often described as control theory + ML. Research: what is the βBellman equationβ and how does it relate to control theoryβs Hamilton-Jacobi-Bellman equation?