import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
Advanced Curriculum Learning TheoryΒΆ
1. Foundations and MotivationΒΆ
Definition: Curriculum Learning is a training strategy where a model is trained on progressively more complex data, analogous to human learning from simple to difficult concepts.
Historical Context:
Bengio et al. (2009): Introduced curriculum learning inspired by human education
Key insight: Easier examples provide better gradient signal early in training
Biological motivation: Humans learn better with structured curricula
Fundamental Hypothesis:
Training with curriculum leads to better local minima than random sampling.
2. Theoretical FoundationsΒΆ
2.1 Convergence Analysis
Theorem (Bengio et al., 2009): Under certain conditions, curriculum learning provides:
Faster convergence to local minima
Escape from poor local minima
Better generalization through regularization
Loss landscape perspective:
Early training with easy examples shapes the loss landscape:
Where \(R(\theta)\) acts as implicit regularization.
Convergence rate:
With curriculum: \(O\left(\frac{1}{\sqrt{T_{\text{easy}} + T_{\text{hard}}}}\right)\)
Random: \(O\left(\frac{1}{\sqrt{T}}\right)\)
The key is \(T_{\text{easy}}\) uses easier gradients β faster initial progress.
2.2 Information-Theoretic View
Entropy-based difficulty:
Low entropy (high confidence) β Easy sample
High entropy (uncertain) β Hard sample
Curriculum as progressive entropy increase:
Gradually expose model to higher-entropy (more uncertain) examples.
2.3 PAC Learning Framework
Sample complexity with curriculum:
Compared to standard:
Curriculum reduces dependence on dimension \(d\) through structured sampling.
3. Difficulty Metrics TaxonomyΒΆ
3.1 Model-Based Metrics
Prediction Confidence:
Low confidence β High difficulty
Loss-Based:
Directly use training loss as difficulty proxy.
Prediction Variance (Ensembles):
Use ensemble disagreement as uncertainty measure.
Teacher-Student:
Use divergence from pre-trained teacher as difficulty.
3.2 Data-Based Metrics
Prototype Distance:
Distance to nearest class centroid \(\mu_c = \frac{1}{|C_c|}\sum_{x_i \in C_c} x_i\)
Manifold Density:
Low density β Outlier β Difficult
3.3 Domain-Specific Metrics
For Images:
Image complexity: Edge density, color variance
Occlusion level
Object size
Multi-object scenes
For NLP:
Sentence length
Syntactic complexity (parse tree depth)
Vocabulary rarity
Semantic ambiguity
For RL:
Episode length
Reward sparsity
Action space complexity
4. Curriculum Scheduling StrategiesΒΆ
4.1 Predefined Schedules
Linear Schedule:
Simplest: Linearly increase difficulty threshold.
Exponential Schedule:
Rapid early growth, then saturation. Good for quick convergence.
Root Schedule:
Conservative growth, gradual difficulty increase.
Step Schedule:
Discrete stages, clear phase transitions.
Cosine Schedule:
Smooth S-curve, used in cosine annealing variants.
4.2 Adaptive Schedules
Performance-Based:
Increase difficulty only when performance threshold met.
Loss-Based:
Accelerate when loss decreases rapidly (good progress).
Gradient-Based:
Faster pacing when gradients are large (strong signal).
5. Self-Paced LearningΒΆ
5.1 Formulation
Joint Optimization (Kumar et al., 2010):
Subject to: \(v_i \in [0, 1]\)
Where:
\(v_i\): Weight for sample \(i\) (0 = exclude, 1 = include)
\(\lambda\): Pacing parameter (controls curriculum speed)
Alternating optimization:
Fix \(v\), update \(\theta\) (train model)
Fix \(\theta\), update \(v\) (select samples)
Sample selection rule:
Model chooses samples with loss below threshold.
5.2 Self-Paced Regularization
Smooth variant:
Soft weights instead of hard 0/1.
Diversity regularization:
Encourage selecting diverse samples, not just easy ones.
6. Transfer Teacher CurriculumΒΆ
6.1 Knowledge Distillation-Based
Teacher provides curriculum:
Samples where teacher is confident are easier.
Progressive distillation:
Epoch \(t\): Train on samples where teacher accuracy > \(\tau(t)\)
6.2 Cross-Domain Transfer
Pre-train on easy domain, fine-tune on hard domain:
Example: ImageNet (easy) β Medical imaging (hard)
Gradual domain shift:
where \(\alpha(t)\) decreases from 1 to 0.
7. Anti-Curriculum LearningΒΆ
7.1 Hard-to-Easy Training
Motivation: Train on hard examples first to learn robust features
Schedule:
Decreasing difficulty threshold (opposite of curriculum).
When to use:
Adversarial robustness
Domain adaptation (hard target domain first)
Debiasing (prioritize minority/hard classes)
7.2 Hard Example Mining
Hard Negative Mining (Object Detection):
Train on misclassified/hard negatives:
Focal Loss (Lin et al., 2017):
Automatically down-weights easy examples, focuses on hard ones.
8. Multi-Task CurriculumΒΆ
8.1 Task-Level Curriculum
Task difficulty ordering:
Example: POS tagging β Parsing β Semantic role labeling
Joint formulation:
where \(w_k(t)\) increases for harder tasks over time.
8.2 Auxiliary Task Curriculum
Progressive dropping:
Early training: Use multiple auxiliary tasks for regularization
Late training: Focus on main task only
Task weights:
Exponentially decay auxiliary task weight.
9. Curriculum for Different ArchitecturesΒΆ
9.1 Vision Transformers
Patch size curriculum:
Early: Large patches (low resolution, easier)
Late: Small patches (high resolution, harder)
9.2 Language Models
Sequence length curriculum:
Start with short sequences, gradually increase.
9.3 Reinforcement Learning
Environment complexity:
Simple mazes β Complex mazes
Few obstacles β Many obstacles
Deterministic β Stochastic
Reward shaping curriculum:
Early: Dense rewards (easier) Late: Sparse rewards (harder, more realistic)
10. Theoretical Guarantees and AnalysisΒΆ
10.1 Generalization Bounds
Theorem (Hacohen & Weinshall, 2019):
With appropriate curriculum, test error satisfies:
where \(\beta\) is curriculum benefit factor (0 < \(\beta\) < 1).
10.2 Sample Complexity
Curriculum reduces required samples:
Logarithmic improvement in dimension dependency.
10.3 Convergence Rate
SGD with curriculum:
where \(\beta_{\text{curriculum}} < 0\) indicates acceleration.
11. Practical ConsiderationsΒΆ
11.1 Hyperparameter Tuning
Key hyperparameters:
Schedule type (linear, exponential, adaptive)
Pacing speed (how fast to increase difficulty)
Initial difficulty threshold
Batch composition (mixed vs pure difficulty levels)
Guidelines:
Conservative pacing: Slower curriculum for complex tasks
Aggressive pacing: Faster for well-structured data
Adaptive: Monitor validation loss, adjust dynamically
11.2 Computational Overhead
Difficulty scoring cost:
Model-based: Requires forward pass β \(O(n)\) per epoch
Data-based: Pre-compute once β \(O(1)\) per epoch
Amortization strategies:
Compute difficulty every K epochs, not every epoch
Use cheaper proxy metrics (e.g., image complexity for vision)
Cache difficulty scores and update periodically
11.3 Curriculum Design Workflow
Define difficulty metric: Choose appropriate measure for domain
Sort/score dataset: Assign difficulty to each sample
Choose schedule: Select pacing strategy (linear, adaptive, etc.)
Monitor performance: Track convergence on validation set
Adjust if needed: Modify schedule if curriculum too fast/slow
12. Advanced TechniquesΒΆ
12.1 Mixture of Difficulties
Instead of pure batches, mix difficulties:
Benefits:
Prevents overfitting to easy examples
Maintains gradient diversity
Smoother transition
12.2 Dynamic Difficulty Adjustment
Per-sample pacing:
Increase difficulty for samples model handles well.
12.3 Curriculum Dropout
Randomly drop curriculum constraint with probability \(p(t)\):
where \(p(t)\) increases over time β gradual transition to random.
13. Failure Modes and MitigationsΒΆ
13.1 Common Pitfalls
Premature convergence:
Curriculum too slow β Model overfits easy examples
Mitigation: Monitor validation performance, accelerate if needed
Catastrophic forgetting:
Hard examples introduced too late β Forget easy patterns
Mitigation: Mixed batches, periodic review of easy examples
Poor difficulty metric:
Metric doesnβt align with true difficulty
Mitigation: Validate metric with human annotation, try multiple metrics
13.2 Debugging Strategies
Difficulty distribution analysis: Plot difficulty scores, check for reasonable spread
Learning curves: Compare curriculum vs. random sampling convergence
Sample inspection: Manually verify easy/medium/hard samples make sense
14. State-of-the-Art MethodsΒΆ
14.1 Competence-Based Curriculum (Graves et al., 2017)
Signal-to-noise ratio:
High gradient magnitude + low variance β Good for learning
14.2 Curriculum by Smoothing (Spitkovsky et al., 2010)
Progressively reduce data augmentation/noise:
Start with smoothed (easier) data, converge to original.
14.3 Automatic Curriculum Learning (ACL)
Meta-learning for curriculum:
Learn curriculum policy \(\phi\) that generates sequences \(\tau\) optimizing validation loss.
15. Empirical Results and BenchmarksΒΆ
15.1 Vision Tasks
CIFAR-10/100:
Curriculum: 2-5% accuracy improvement
Faster convergence: 20-30% fewer epochs
ImageNet:
Curriculum: 1-2% top-1 accuracy gain
Reduced training time: 15-20%
15.2 NLP Tasks
Machine Translation:
Curriculum (shortβlong sentences): 2-4 BLEU improvement
Faster convergence: 25% fewer steps
Language Modeling:
Perplexity reduction: 5-10%
Especially effective for low-resource languages
15.3 Reinforcement Learning
Atari Games:
Curriculum (easyβhard levels): 20-40% higher scores
More stable training
Robotics:
Sim-to-real transfer: Curriculum reduces reality gap
Progressive environment complexity improves generalization
16. Key Papers and TimelineΒΆ
Foundational (2009-2012):
Bengio et al. 2009: Curriculum Learning - Original concept and motivation
Kumar et al. 2010: Self-Paced Learning - Model selects easy samples
Lee & Grauman 2011: Learning the Easy Things First - Attributes curriculum
Methods (2013-2017):
Jiang et al. 2015: Self-Paced Learning with Diversity - Diverse sample selection
Graves et al. 2017: Automated Curriculum Learning - Meta-learning curriculum
Hacohen & Weinshall 2019: On The Power of Curriculum Learning - Theoretical analysis
Applications (2018-2024):
Soviany et al. 2021: Curriculum Learning Survey - Comprehensive review
Xu et al. 2020: Curriculum Learning for NLP - Text-specific strategies
Narvekar et al. 2020: Curriculum Learning for RL - RL-specific methods
Computational Complexity AnalysisΒΆ
Difficulty computation:
Model-based: \(O(n \cdot C_{\text{forward}})\) where \(C_{\text{forward}}\) is forward pass cost
Data-based: \(O(n \cdot d)\) where d is feature dimension
Curriculum overhead:
Sorting: \(O(n \log n)\) (one-time or periodic)
Sample selection per epoch: \(O(n)\) (filtering by threshold)
Total complexity:
Typically \(C_{\text{difficulty}} + C_{\text{selection}} \ll C_{\text{training}}\), so overhead is negligible.
"""
Advanced Curriculum Learning Implementations
This cell provides production-ready implementations of:
1. Self-Paced Learning (SPL) with soft sample weighting
2. Transfer Teacher Curriculum (knowledge distillation-based)
3. Competence-Based Curriculum (gradient-based difficulty)
4. Multi-Task Curriculum Learning
5. Adaptive Curriculum Scheduler
6. Mixture Curriculum (mixed difficulty batches)
7. Curriculum evaluation and visualization tools
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader, Subset, WeightedRandomSampler
from sklearn.metrics import accuracy_score
import copy
from collections import defaultdict
# ============================================================================
# Self-Paced Learning (SPL)
# ============================================================================
class SelfPacedLearning:
"""
Self-Paced Learning (Kumar et al., 2010)
Theory:
- Jointly optimize model parameters ΞΈ and sample weights v
- min_{ΞΈ,v} Ξ£ v_iΒ·L_i - Ξ»/2Β·||v||Β²
- v_i β [0,1] controls sample inclusion
- Ξ»: pacing parameter (increases over time)
"""
def __init__(self, model, lambda_init=1.0, lambda_growth=1.1,
soft=True, temperature=1.0):
"""
Args:
model: Neural network to train
lambda_init: Initial pacing parameter
lambda_growth: Growth rate per epoch (Ξ»_t = Ξ»_{t-1} * growth)
soft: Use soft weights (True) or hard 0/1 (False)
temperature: Temperature for soft weighting
"""
self.model = model
self.lambda_param = lambda_init
self.lambda_growth = lambda_growth
self.soft = soft
self.temperature = temperature
self.sample_weights = None
def compute_sample_weights(self, losses):
"""
Compute sample weights based on current losses
Args:
losses: (N,) array of per-sample losses
Returns:
weights: (N,) array of sample weights in [0,1]
"""
if self.soft:
# Soft weighting: v_i = Ξ» / (Ξ» + L_i / T)
weights = self.lambda_param / (self.lambda_param + losses / self.temperature)
else:
# Hard weighting: v_i = 1 if L_i < Ξ» else 0
weights = (losses < self.lambda_param).astype(np.float32)
return weights
def train_epoch(self, train_loader, optimizer, device='cpu'):
"""
Train for one epoch with self-paced sample selection
Returns:
avg_loss: Average weighted loss
inclusion_rate: Fraction of samples with weight > 0.5
"""
self.model.train()
# First pass: compute losses for all samples
all_losses = []
all_data = []
with torch.no_grad():
for x, y in train_loader:
x, y = x.to(device), y.to(device)
output = self.model(x)
loss = F.cross_entropy(output, y, reduction='none')
all_losses.extend(loss.cpu().numpy())
all_data.append((x, y))
all_losses = np.array(all_losses)
# Compute sample weights
weights = self.compute_sample_weights(all_losses)
self.sample_weights = weights
# Second pass: train with weighted loss
total_loss = 0
num_batches = 0
idx = 0
for x, y in all_data:
batch_size = len(y)
batch_weights = torch.tensor(
weights[idx:idx+batch_size],
dtype=torch.float32,
device=device
)
idx += batch_size
# Forward pass
output = self.model(x)
loss = F.cross_entropy(output, y, reduction='none')
# Weighted loss
weighted_loss = (loss * batch_weights).mean()
# Backward pass
optimizer.zero_grad()
weighted_loss.backward()
optimizer.step()
total_loss += weighted_loss.item()
num_batches += 1
# Update pacing parameter
self.lambda_param *= self.lambda_growth
avg_loss = total_loss / num_batches
inclusion_rate = (weights > 0.5).mean()
return avg_loss, inclusion_rate
# ============================================================================
# Transfer Teacher Curriculum
# ============================================================================
class TransferTeacherCurriculum:
"""
Use pre-trained teacher to guide curriculum
Theory:
- Teacher provides difficulty scores based on confidence
- D_i = 1 - max_y p_teacher(y|x_i)
- Train student on samples where teacher is confident (low D_i)
"""
def __init__(self, student_model, teacher_model, schedule='linear'):
"""
Args:
student_model: Model to train
teacher_model: Pre-trained model (frozen)
schedule: Difficulty threshold schedule
"""
self.student = student_model
self.teacher = teacher_model
self.teacher.eval() # Freeze teacher
self.schedule = schedule
self.difficulties = None
def compute_difficulties(self, dataset, device='cpu'):
"""
Compute teacher-based difficulty for all samples
Returns:
difficulties: (N,) array of difficulty scores
"""
loader = DataLoader(dataset, batch_size=256, shuffle=False)
difficulties = []
with torch.no_grad():
for x, y in loader:
x = x.to(device)
output = self.teacher(x)
probs = F.softmax(output, dim=1)
# Difficulty = 1 - max probability (teacher confidence)
max_probs = probs.max(dim=1)[0]
difficulty = 1 - max_probs
difficulties.extend(difficulty.cpu().numpy())
self.difficulties = np.array(difficulties)
return self.difficulties
def get_curriculum_subset(self, epoch, total_epochs):
"""
Get sample indices for current epoch based on teacher difficulty
Args:
epoch: Current epoch (0-indexed)
total_epochs: Total number of epochs
Returns:
indices: Array of selected sample indices
"""
progress = epoch / total_epochs
if self.schedule == 'linear':
threshold = progress
elif self.schedule == 'exponential':
threshold = 1 - np.exp(-3 * progress)
else:
threshold = 1.0
# Select samples below difficulty threshold
max_difficulty = np.percentile(self.difficulties, threshold * 100)
indices = np.where(self.difficulties <= max_difficulty)[0]
return indices
# ============================================================================
# Competence-Based Curriculum
# ============================================================================
class CompetenceBasedCurriculum:
"""
Competence-Based Curriculum (Graves et al., 2017)
Theory:
- Difficulty based on gradient signal-to-noise ratio
- D_i = ||βL_i|| / Var(L_i)
- Prioritize samples with strong, consistent gradients
"""
def __init__(self, model, window_size=100):
"""
Args:
model: Neural network
window_size: Window for computing loss variance
"""
self.model = model
self.window_size = window_size
self.loss_history = defaultdict(list) # Per-sample loss history
def compute_competence_scores(self, dataset, device='cpu'):
"""
Compute competence score for each sample
Returns:
scores: (N,) array where higher = better for learning
"""
loader = DataLoader(dataset, batch_size=1, shuffle=False)
scores = []
for idx, (x, y) in enumerate(loader):
x, y = x.to(device), y.to(device)
x.requires_grad = True
# Forward pass
output = self.model(x)
loss = F.cross_entropy(output, y)
# Compute gradient magnitude
loss.backward()
grad_norm = torch.norm(x.grad).item()
# Track loss history
self.loss_history[idx].append(loss.item())
if len(self.loss_history[idx]) > self.window_size:
self.loss_history[idx].pop(0)
# Compute variance
if len(self.loss_history[idx]) >= 2:
loss_var = np.var(self.loss_history[idx])
loss_var = max(loss_var, 1e-6) # Avoid division by zero
else:
loss_var = 1.0
# Competence = gradient magnitude / loss variance
competence = grad_norm / loss_var
scores.append(competence)
return np.array(scores)
# ============================================================================
# Multi-Task Curriculum
# ============================================================================
class MultiTaskCurriculum:
"""
Curriculum across multiple tasks
Theory:
- Train on easier tasks first, progressively add harder tasks
- L_total = Ξ£ w_k(t)Β·L_k where w_k increases for harder tasks
"""
def __init__(self, model, task_difficulties, schedule='linear'):
"""
Args:
model: Multi-task model with task-specific heads
task_difficulties: List of task difficulty scores (higher = harder)
schedule: How to schedule task weights over time
"""
self.model = model
self.task_difficulties = np.array(task_difficulties)
self.num_tasks = len(task_difficulties)
self.schedule = schedule
# Sort tasks by difficulty
self.task_order = np.argsort(self.task_difficulties)
def get_task_weights(self, epoch, total_epochs):
"""
Compute task weights for current epoch
Returns:
weights: (num_tasks,) array of task weights
"""
progress = epoch / total_epochs
weights = np.zeros(self.num_tasks)
if self.schedule == 'sequential':
# Train one task at a time in order
task_idx = min(int(progress * self.num_tasks), self.num_tasks - 1)
weights[self.task_order[task_idx]] = 1.0
elif self.schedule == 'progressive':
# Gradually add tasks
num_active = int(progress * self.num_tasks) + 1
for i in range(num_active):
weights[self.task_order[i]] = 1.0
weights = weights / weights.sum()
elif self.schedule == 'smooth':
# Smooth transition with sigmoid
for i, task_idx in enumerate(self.task_order):
# Activate task i at progress i/num_tasks
activation_point = i / self.num_tasks
weights[task_idx] = 1 / (1 + np.exp(-10 * (progress - activation_point)))
weights = weights / weights.sum()
return weights
# ============================================================================
# Adaptive Curriculum Scheduler
# ============================================================================
class AdaptiveCurriculumScheduler:
"""
Adaptively adjust curriculum pace based on validation performance
Theory:
- If validation accuracy increases, accelerate curriculum
- If validation accuracy decreases, slow down or revert
"""
def __init__(self, initial_threshold=0.3, acceleration=0.1,
patience=3, min_threshold=0.0, max_threshold=1.0):
"""
Args:
initial_threshold: Starting difficulty threshold
acceleration: How much to increase threshold on success
patience: Epochs to wait before adjusting
min_threshold: Minimum threshold value
max_threshold: Maximum threshold value
"""
self.threshold = initial_threshold
self.acceleration = acceleration
self.patience = patience
self.min_threshold = min_threshold
self.max_threshold = max_threshold
self.best_val_acc = 0.0
self.wait = 0
self.threshold_history = [initial_threshold]
def step(self, val_acc):
"""
Update threshold based on validation accuracy
Args:
val_acc: Current validation accuracy
Returns:
new_threshold: Updated difficulty threshold
"""
if val_acc > self.best_val_acc:
# Improvement: accelerate curriculum
self.best_val_acc = val_acc
self.wait = 0
self.threshold = min(self.threshold + self.acceleration, self.max_threshold)
else:
# No improvement: wait or decelerate
self.wait += 1
if self.wait >= self.patience:
# Slow down curriculum
self.threshold = max(self.threshold - self.acceleration / 2, self.min_threshold)
self.wait = 0
self.threshold_history.append(self.threshold)
return self.threshold
# ============================================================================
# Mixture Curriculum
# ============================================================================
class MixtureCurriculum:
"""
Sample batches with mixture of difficulties
Theory:
- Batch_t = Ξ±(t)Β·Easy + (1-Ξ±(t))Β·Hard
- Prevents overfitting to easy examples
- Maintains gradient diversity
"""
def __init__(self, dataset, difficulties, batch_size=128,
easy_ratio_start=0.8, easy_ratio_end=0.2):
"""
Args:
dataset: Full dataset
difficulties: (N,) array of sample difficulties
batch_size: Batch size
easy_ratio_start: Initial fraction of easy samples in batch
easy_ratio_end: Final fraction of easy samples in batch
"""
self.dataset = dataset
self.difficulties = difficulties
self.batch_size = batch_size
self.easy_ratio_start = easy_ratio_start
self.easy_ratio_end = easy_ratio_end
# Split into easy/medium/hard
self.easy_idx = np.where(difficulties < np.percentile(difficulties, 33))[0]
self.medium_idx = np.where((difficulties >= np.percentile(difficulties, 33)) &
(difficulties < np.percentile(difficulties, 67)))[0]
self.hard_idx = np.where(difficulties >= np.percentile(difficulties, 67))[0]
def get_mixed_batch(self, epoch, total_epochs):
"""
Sample batch with mixture of difficulties
Returns:
batch_indices: Indices for current batch
"""
progress = epoch / total_epochs
# Linear interpolation of easy ratio
easy_ratio = self.easy_ratio_start + (self.easy_ratio_end - self.easy_ratio_start) * progress
hard_ratio = 1 - easy_ratio
# Sample counts
num_easy = int(self.batch_size * easy_ratio * 0.5)
num_medium = int(self.batch_size * easy_ratio * 0.5)
num_hard = self.batch_size - num_easy - num_medium
# Sample from each difficulty level
easy_samples = np.random.choice(self.easy_idx, size=num_easy, replace=False)
medium_samples = np.random.choice(self.medium_idx, size=num_medium, replace=False)
hard_samples = np.random.choice(self.hard_idx, size=num_hard, replace=False)
# Combine and shuffle
batch_indices = np.concatenate([easy_samples, medium_samples, hard_samples])
np.random.shuffle(batch_indices)
return batch_indices
# ============================================================================
# Curriculum Evaluation Tools
# ============================================================================
class CurriculumEvaluator:
"""Tools for evaluating and visualizing curriculum effectiveness"""
@staticmethod
def plot_difficulty_distribution(difficulties, num_bins=50):
"""Plot histogram of difficulty scores"""
plt.figure(figsize=(10, 5))
plt.hist(difficulties, bins=num_bins, alpha=0.7, edgecolor='black')
plt.xlabel('Difficulty Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Difficulty Distribution', fontsize=13)
plt.axvline(difficulties.mean(), color='red', linestyle='--',
label=f'Mean: {difficulties.mean():.3f}')
plt.axvline(np.median(difficulties), color='green', linestyle='--',
label=f'Median: {np.median(difficulties):.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
return plt.gcf()
@staticmethod
def plot_curriculum_progress(threshold_history, sample_counts):
"""Plot curriculum progression over epochs"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
epochs = np.arange(len(threshold_history))
# Threshold progression
ax1.plot(epochs, threshold_history, 'b-o', linewidth=2, markersize=6)
ax1.fill_between(epochs, threshold_history, alpha=0.3)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Difficulty Threshold', fontsize=12)
ax1.set_title('Curriculum Threshold Progression', fontsize=13)
ax1.grid(True, alpha=0.3)
# Sample count progression
ax2.plot(epochs, sample_counts, 'g-o', linewidth=2, markersize=6)
ax2.fill_between(epochs, sample_counts, alpha=0.3, color='green')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Number of Training Samples', fontsize=12)
ax2.set_title('Training Set Size Over Time', fontsize=13)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
return fig
@staticmethod
def compare_curricula(results_dict, metric='test_accuracy'):
"""
Compare multiple curriculum strategies
Args:
results_dict: {name: {'train_loss': [...], 'test_acc': [...]}}
metric: Which metric to plot ('train_loss' or 'test_accuracy')
"""
plt.figure(figsize=(12, 6))
for name, results in results_dict.items():
epochs = np.arange(len(results[metric]))
plt.plot(epochs, results[metric], '-o', label=name, linewidth=2, markersize=5)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel(metric.replace('_', ' ').title(), fontsize=12)
plt.title(f'Curriculum Strategy Comparison - {metric.replace("_", " ").title()}',
fontsize=13)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
return plt.gcf()
@staticmethod
def plot_sample_weight_evolution(weight_history, sample_indices):
"""
Plot how sample weights evolve over training (for SPL)
Args:
weight_history: (epochs, num_samples) array of weights
sample_indices: Indices of samples to track
"""
plt.figure(figsize=(12, 6))
epochs = np.arange(weight_history.shape[0])
for idx in sample_indices:
plt.plot(epochs, weight_history[:, idx], '-',
label=f'Sample {idx}', alpha=0.7, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Sample Weight', fontsize=12)
plt.title('Sample Weight Evolution (Self-Paced Learning)', fontsize=13)
plt.legend(fontsize=10, ncol=2)
plt.grid(True, alpha=0.3)
plt.ylim(-0.05, 1.05)
plt.tight_layout()
return plt.gcf()
# ============================================================================
# Demonstration
# ============================================================================
print("Advanced Curriculum Learning Methods Implemented:")
print("=" * 70)
print("1. SelfPacedLearning - Joint optimization of model and sample weights")
print("2. TransferTeacherCurriculum - Teacher-guided difficulty scoring")
print("3. CompetenceBasedCurriculum - Gradient signal-to-noise ratio")
print("4. MultiTaskCurriculum - Progressive task scheduling")
print("5. AdaptiveCurriculumScheduler - Performance-based pacing")
print("6. MixtureCurriculum - Mixed-difficulty batches")
print("7. CurriculumEvaluator - Visualization and comparison tools")
print("=" * 70)
# Example: Self-Paced Learning
print("\nExample: Self-Paced Learning")
print("-" * 70)
# Simple dataset and model for demonstration
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor()])
mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader_simple = DataLoader(mnist_train, batch_size=128, shuffle=True)
test_loader_simple = DataLoader(mnist_test, batch_size=1000, shuffle=False)
# Simple model
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 784)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_spl = SimpleNet().to(device)
optimizer_spl = torch.optim.Adam(model_spl.parameters(), lr=1e-3)
# Initialize SPL
spl = SelfPacedLearning(
model_spl,
lambda_init=0.5,
lambda_growth=1.15,
soft=True,
temperature=1.0
)
print(f"Initial Ξ»: {spl.lambda_param:.3f}")
print("Training for 5 epochs with Self-Paced Learning...")
inclusion_rates = []
for epoch in range(5):
avg_loss, inclusion_rate = spl.train_epoch(train_loader_simple, optimizer_spl, device)
inclusion_rates.append(inclusion_rate)
# Quick eval
model_spl.eval()
correct = 0
total = 0
with torch.no_grad():
for x, y in test_loader_simple:
x, y = x.to(device), y.to(device)
output = model_spl(x)
pred = output.argmax(1)
correct += (pred == y).sum().item()
total += len(y)
acc = 100 * correct / total
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Ξ»={spl.lambda_param:.3f}, "
f"Inclusion={inclusion_rate:.2%}, Test Acc={acc:.2f}%")
# Visualize sample weight distribution
plt.figure(figsize=(10, 5))
plt.hist(spl.sample_weights, bins=50, alpha=0.7, edgecolor='black', color='purple')
plt.xlabel('Sample Weight', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Final Sample Weight Distribution (Self-Paced Learning)', fontsize=13)
plt.axvline(0.5, color='red', linestyle='--', label='Inclusion Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nFinal statistics:")
print(f" Samples with weight > 0.5: {(spl.sample_weights > 0.5).sum()}/{len(spl.sample_weights)}")
print(f" Mean weight: {spl.sample_weights.mean():.3f}")
print(f" Weight std: {spl.sample_weights.std():.3f}")
print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print("1. Self-Paced Learning: Model automatically selects easy samples early")
print("2. Ξ» parameter: Controls pacing (increases over time to include harder samples)")
print("3. Soft weights: Smoother than hard 0/1, better gradient flow")
print("4. Inclusion rate: Starts low, gradually increases as model improves")
print("5. Applications: Noisy labels, imbalanced data, domain adaptation")
print("=" * 70)
1. Curriculum LearningΒΆ
ConceptΒΆ
Train on easier examples first, gradually increase difficulty:
where \(\lambda(t)\) increases with training step \(t\).
Benefits:ΒΆ
Faster convergence
Better generalization
Escape local minima
π Reference Materials:
foundation_neural_network.pdf - Foundation Neural Network
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 784)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
# Load MNIST
transform = transforms.Compose([transforms.ToTensor()])
mnist = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_mnist = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(mnist, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_mnist, batch_size=1000)
print("Data loaded")
Difficulty ScoringΒΆ
Curriculum learning requires a way to rank training examples by difficulty. Common approaches include loss-based scoring (examples with higher initial loss are harder), model confidence (examples predicted with lower probability are harder), and domain heuristics (e.g., sentence length for NLP, image complexity for vision). The difficulty scores determine the order in which examples are presented during training β starting with easy examples and gradually introducing harder ones. The scoring function should correlate with genuine learning difficulty, not just noise, so using a pre-trained modelβs predictions often works better than random initialization loss.
def compute_difficulty_scores(model, dataset):
"""Compute difficulty for each sample."""
model.eval()
difficulties = []
loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=False)
with torch.no_grad():
for x, y in loader:
x, y = x.to(device), y.to(device)
output = model(x)
# Difficulty = 1 - confidence on true class
probs = F.softmax(output, dim=1)
confidence = probs[torch.arange(len(y)), y]
difficulty = 1 - confidence
difficulties.extend(difficulty.cpu().numpy())
return np.array(difficulties)
# Initialize model and compute initial difficulties
model = SimpleNet().to(device)
difficulties = compute_difficulty_scores(model, mnist)
plt.figure(figsize=(10, 5))
plt.hist(difficulties, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Difficulty Score', fontsize=11)
plt.ylabel('Frequency', fontsize=11)
plt.title('Initial Difficulty Distribution', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()
print(f"Mean difficulty: {difficulties.mean():.3f}")
Curriculum SchedulerΒΆ
The curriculum scheduler controls how quickly the training set expands from easy to hard examples. Common schedules include linear (add a fixed fraction of harder examples each epoch), exponential (double the effective dataset size at regular intervals), and self-paced (let the modelβs current performance determine when to include harder examples). The key hyperparameter is the pace: too fast and the curriculum collapses to standard random training; too slow and the model wastes time on easy examples it has already mastered. Self-paced learning adaptively adjusts the difficulty threshold based on the modelβs current loss distribution, providing an automatic curriculum.
class CurriculumScheduler:
"""Schedule curriculum difficulty."""
def __init__(self, difficulties, schedule='linear', epochs=10):
self.difficulties = difficulties
self.schedule = schedule
self.epochs = epochs
def get_threshold(self, epoch):
"""Get difficulty threshold for epoch."""
progress = epoch / self.epochs
if self.schedule == 'linear':
threshold = progress
elif self.schedule == 'exponential':
threshold = 1 - np.exp(-3 * progress)
elif self.schedule == 'root':
threshold = np.sqrt(progress)
else:
threshold = 1.0
return threshold
def get_indices(self, epoch):
"""Get sample indices for current epoch."""
threshold = self.get_threshold(epoch)
max_difficulty = np.percentile(self.difficulties, threshold * 100)
indices = np.where(self.difficulties <= max_difficulty)[0]
return indices
# Visualize schedules
scheduler = CurriculumScheduler(difficulties, epochs=10)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
epochs = np.arange(10)
for schedule in ['linear', 'exponential', 'root']:
scheduler.schedule = schedule
thresholds = [scheduler.get_threshold(e) for e in epochs]
axes[0].plot(epochs, thresholds, marker='o', label=schedule.capitalize())
axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Difficulty Threshold', fontsize=11)
axes[0].set_title('Curriculum Schedules', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Sample sizes
scheduler.schedule = 'linear'
sizes = [len(scheduler.get_indices(e)) for e in epochs]
axes[1].bar(epochs, sizes, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Training Set Size', fontsize=11)
axes[1].set_title('Progressive Training Set Growth', fontsize=12)
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Train with CurriculumΒΆ
Training with the curriculum involves starting from the easiest subset and progressively expanding to include harder examples according to the scheduler. At each stage, the model trains on the current subset until performance plateaus, then the difficulty threshold advances. This staged approach often reaches higher final accuracy and converges faster than standard training, particularly for datasets with significant difficulty variance or noisy labels. The intuition mirrors human learning: mastering fundamentals before tackling advanced material builds a more robust foundation.
def train_with_curriculum(model, dataset, test_loader, scheduler, n_epochs=10):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
train_losses = []
test_accs = []
for epoch in range(n_epochs):
# Get curriculum subset
indices = scheduler.get_indices(epoch)
subset = torch.utils.data.Subset(dataset, indices)
loader = torch.utils.data.DataLoader(subset, batch_size=128, shuffle=True)
# Train
model.train()
epoch_loss = 0
for x, y in loader:
x, y = x.to(device), y.to(device)
output = model(x)
loss = F.cross_entropy(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
train_losses.append(epoch_loss / len(loader))
# Evaluate
model.eval()
correct = 0
total = 0
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
output = model(x)
pred = output.argmax(dim=1)
correct += (pred == y).sum().item()
total += y.size(0)
acc = 100 * correct / total
test_accs.append(acc)
print(f"Epoch {epoch+1}: {len(indices)} samples, Acc: {acc:.2f}%")
return train_losses, test_accs
# Train with curriculum
model_curriculum = SimpleNet().to(device)
scheduler = CurriculumScheduler(difficulties, schedule='linear', epochs=10)
losses_curr, accs_curr = train_with_curriculum(model_curriculum, mnist, test_loader, scheduler, 10)
Baseline TrainingΒΆ
For a fair comparison, we train the same model architecture with standard random-order training (no curriculum). The baseline uses identical hyperparameters (learning rate, batch size, optimizer, total epochs) so that any performance difference can be attributed to the training order rather than other factors. Running multiple random seeds for both curriculum and baseline training provides statistical significance for the comparison.
def train_baseline(model, train_loader, test_loader, n_epochs=10):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
train_losses = []
test_accs = []
for epoch in range(n_epochs):
model.train()
epoch_loss = 0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
output = model(x)
loss = F.cross_entropy(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
train_losses.append(epoch_loss / len(train_loader))
# Evaluate
model.eval()
correct = 0
total = 0
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
output = model(x)
pred = output.argmax(dim=1)
correct += (pred == y).sum().item()
total += y.size(0)
acc = 100 * correct / total
test_accs.append(acc)
print(f"Epoch {epoch+1}, Acc: {acc:.2f}%")
return train_losses, test_accs
# Train baseline
model_baseline = SimpleNet().to(device)
losses_base, accs_base = train_baseline(model_baseline, train_loader, test_loader, 10)
Compare ResultsΒΆ
Comparing the learning curves and final accuracies of curriculum training versus baseline random training reveals the impact of example ordering. A successful curriculum typically shows faster initial improvement (the model quickly learns from easy examples) and higher final accuracy (gradual difficulty increase provides better regularization). The benefit is most pronounced on tasks with high difficulty variance, noisy labels, or limited training data. On clean, balanced datasets the advantage may be small, which is itself an informative finding.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Training loss
axes[0].plot(losses_base, 'b-o', label='Baseline', markersize=5)
axes[0].plot(losses_curr, 'r-o', label='Curriculum', markersize=5)
axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Training Loss', fontsize=11)
axes[0].set_title('Training Loss Comparison', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Test accuracy
axes[1].plot(accs_base, 'b-o', label='Baseline', markersize=5)
axes[1].plot(accs_curr, 'r-o', label='Curriculum', markersize=5)
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=11)
axes[1].set_title('Test Accuracy Comparison', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nFinal Accuracy - Baseline: {accs_base[-1]:.2f}%, Curriculum: {accs_curr[-1]:.2f}%")
SummaryΒΆ
Curriculum Learning:ΒΆ
Easy-to-hard training progression
Difficulty scoring based on model confidence
Adaptive scheduling (linear, exponential, root)
Faster convergence in early stages
Difficulty Metrics:ΒΆ
Model confidence (1 - p(y|x))
Loss magnitude
Prediction variance
Domain-specific heuristics
Applications:ΒΆ
Image classification
Language modeling
Reinforcement learning
Neural architecture search
Variants:ΒΆ
Self-paced learning: Model selects samples
Transfer teacher: Use pre-trained model
Anti-curriculum: Hard-to-easy for robustness
Dynamic curriculum: Adapt based on progress