import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Advanced Curriculum Learning TheoryΒΆ

1. Foundations and MotivationΒΆ

Definition: Curriculum Learning is a training strategy where a model is trained on progressively more complex data, analogous to human learning from simple to difficult concepts.

Historical Context:

  • Bengio et al. (2009): Introduced curriculum learning inspired by human education

  • Key insight: Easier examples provide better gradient signal early in training

  • Biological motivation: Humans learn better with structured curricula

Fundamental Hypothesis:

\[\mathcal{L}_{\text{curriculum}}(f_{\theta}) < \mathcal{L}_{\text{random}}(f_{\theta})\]

Training with curriculum leads to better local minima than random sampling.

2. Theoretical FoundationsΒΆ

2.1 Convergence Analysis

Theorem (Bengio et al., 2009): Under certain conditions, curriculum learning provides:

  1. Faster convergence to local minima

  2. Escape from poor local minima

  3. Better generalization through regularization

Loss landscape perspective:

Early training with easy examples shapes the loss landscape:

\[\mathcal{L}(\theta; \mathcal{D}_{\text{easy}}) \approx \mathcal{L}(\theta; \mathcal{D}_{\text{all}}) + R(\theta)\]

Where \(R(\theta)\) acts as implicit regularization.

Convergence rate:

With curriculum: \(O\left(\frac{1}{\sqrt{T_{\text{easy}} + T_{\text{hard}}}}\right)\)

Random: \(O\left(\frac{1}{\sqrt{T}}\right)\)

The key is \(T_{\text{easy}}\) uses easier gradients β†’ faster initial progress.

2.2 Information-Theoretic View

Entropy-based difficulty:

\[D(x) = H(Y|X=x) = -\sum_{y} p(y|x) \log p(y|x)\]
  • Low entropy (high confidence) β†’ Easy sample

  • High entropy (uncertain) β†’ Hard sample

Curriculum as progressive entropy increase:

\[H_t = H_{t-1} + \epsilon_t, \quad \epsilon_t > 0\]

Gradually expose model to higher-entropy (more uncertain) examples.

2.3 PAC Learning Framework

Sample complexity with curriculum:

\[m_{\text{curriculum}} = O\left(\frac{d}{\epsilon^2} \log\frac{1}{\delta}\right)\]

Compared to standard:

\[m_{\text{standard}} = O\left(\frac{d \log d}{\epsilon^2} \log\frac{1}{\delta}\right)\]

Curriculum reduces dependence on dimension \(d\) through structured sampling.

3. Difficulty Metrics TaxonomyΒΆ

3.1 Model-Based Metrics

Prediction Confidence:

\[D_{\text{conf}}(x, y) = 1 - p_{\theta}(y|x)\]

Low confidence β†’ High difficulty

Loss-Based:

\[D_{\text{loss}}(x, y) = \mathcal{L}(f_{\theta}(x), y)\]

Directly use training loss as difficulty proxy.

Prediction Variance (Ensembles):

\[D_{\text{var}}(x) = \text{Var}_{i=1}^{k}[f_{\theta_i}(x)]\]

Use ensemble disagreement as uncertainty measure.

Teacher-Student:

\[D_{\text{transfer}}(x) = \text{KL}[p_{\text{teacher}}(y|x) \| p_{\text{student}}(y|x)]\]

Use divergence from pre-trained teacher as difficulty.

3.2 Data-Based Metrics

Prototype Distance:

\[D_{\text{proto}}(x) = \min_{c} \|x - \mu_c\|_2\]

Distance to nearest class centroid \(\mu_c = \frac{1}{|C_c|}\sum_{x_i \in C_c} x_i\)

Manifold Density:

\[D_{\text{density}}(x) = -\log p(x) \approx -\frac{1}{k}\sum_{i=1}^{k} \|x - x_i^{(NN)}\|\]

Low density β†’ Outlier β†’ Difficult

3.3 Domain-Specific Metrics

For Images:

  • Image complexity: Edge density, color variance

  • Occlusion level

  • Object size

  • Multi-object scenes

For NLP:

  • Sentence length

  • Syntactic complexity (parse tree depth)

  • Vocabulary rarity

  • Semantic ambiguity

For RL:

  • Episode length

  • Reward sparsity

  • Action space complexity

4. Curriculum Scheduling StrategiesΒΆ

4.1 Predefined Schedules

Linear Schedule:

\[\lambda(t) = \min\left(1, \frac{t}{T}\right)\]

Simplest: Linearly increase difficulty threshold.

Exponential Schedule:

\[\lambda(t) = 1 - e^{-\alpha t / T}\]

Rapid early growth, then saturation. Good for quick convergence.

Root Schedule:

\[\lambda(t) = \sqrt{\frac{t}{T}}\]

Conservative growth, gradual difficulty increase.

Step Schedule:

\[\begin{split}\lambda(t) = \begin{cases} 0.3 & t < 0.25T \\ 0.6 & 0.25T \leq t < 0.5T \\ 1.0 & t \geq 0.5T \end{cases}\end{split}\]

Discrete stages, clear phase transitions.

Cosine Schedule:

\[\lambda(t) = \frac{1}{2}\left(1 - \cos\left(\frac{\pi t}{T}\right)\right)\]

Smooth S-curve, used in cosine annealing variants.

4.2 Adaptive Schedules

Performance-Based:

\[\begin{split}\lambda(t+1) = \begin{cases} \lambda(t) + \delta & \text{if } \text{Acc}(t) > \tau \\ \lambda(t) & \text{otherwise} \end{cases}\end{split}\]

Increase difficulty only when performance threshold met.

Loss-Based:

\[\lambda(t+1) = \lambda(t) + \beta \cdot \frac{\mathcal{L}(t-1) - \mathcal{L}(t)}{\mathcal{L}(t-1)}\]

Accelerate when loss decreases rapidly (good progress).

Gradient-Based:

\[\lambda(t+1) = \lambda(t) + \gamma \cdot \|\nabla_{\theta} \mathcal{L}\|\]

Faster pacing when gradients are large (strong signal).

5. Self-Paced LearningΒΆ

5.1 Formulation

Joint Optimization (Kumar et al., 2010):

\[\min_{\theta, v} \sum_{i=1}^{n} v_i \mathcal{L}(f_{\theta}(x_i), y_i) - \frac{\lambda}{2} \|v\|^2\]

Subject to: \(v_i \in [0, 1]\)

Where:

  • \(v_i\): Weight for sample \(i\) (0 = exclude, 1 = include)

  • \(\lambda\): Pacing parameter (controls curriculum speed)

Alternating optimization:

  1. Fix \(v\), update \(\theta\) (train model)

  2. Fix \(\theta\), update \(v\) (select samples)

Sample selection rule:

\[\begin{split}v_i^* = \begin{cases} 1 & \text{if } \mathcal{L}(f_{\theta}(x_i), y_i) < \lambda \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Model chooses samples with loss below threshold.

5.2 Self-Paced Regularization

Smooth variant:

\[v_i^* = \frac{\lambda}{\lambda + \mathcal{L}(f_{\theta}(x_i), y_i)}\]

Soft weights instead of hard 0/1.

Diversity regularization:

\[\min_{\theta, v} \sum_i v_i \mathcal{L}_i - \frac{\lambda}{2}\|v\|^2 + \mu \cdot R_{\text{diversity}}(v)\]

Encourage selecting diverse samples, not just easy ones.

6. Transfer Teacher CurriculumΒΆ

6.1 Knowledge Distillation-Based

Teacher provides curriculum:

\[D_i = \text{KL}[p_{\text{teacher}}(y|x_i) \| \text{Uniform}]\]

Samples where teacher is confident are easier.

Progressive distillation:

Epoch \(t\): Train on samples where teacher accuracy > \(\tau(t)\)

\[\mathcal{D}_t = \{(x_i, y_i) : \max_y p_{\text{teacher}}(y|x_i) > \tau(t)\}\]

6.2 Cross-Domain Transfer

Pre-train on easy domain, fine-tune on hard domain:

Example: ImageNet (easy) β†’ Medical imaging (hard)

Gradual domain shift:

\[\mathcal{D}_t = \alpha(t) \cdot \mathcal{D}_{\text{source}} + (1-\alpha(t)) \cdot \mathcal{D}_{\text{target}}\]

where \(\alpha(t)\) decreases from 1 to 0.

7. Anti-Curriculum LearningΒΆ

7.1 Hard-to-Easy Training

Motivation: Train on hard examples first to learn robust features

Schedule:

\[\lambda(t) = 1 - \frac{t}{T}\]

Decreasing difficulty threshold (opposite of curriculum).

When to use:

  • Adversarial robustness

  • Domain adaptation (hard target domain first)

  • Debiasing (prioritize minority/hard classes)

7.2 Hard Example Mining

Hard Negative Mining (Object Detection):

Train on misclassified/hard negatives:

\[\mathcal{D}_{\text{hard}} = \{x_i : \mathcal{L}(f_{\theta}(x_i), y_i) > \tau\}\]

Focal Loss (Lin et al., 2017):

\[\mathcal{L}_{\text{focal}} = -(1-p_t)^{\gamma} \log p_t\]

Automatically down-weights easy examples, focuses on hard ones.

8. Multi-Task CurriculumΒΆ

8.1 Task-Level Curriculum

Task difficulty ordering:

\[\text{Task 1 (Easy)} \to \text{Task 2 (Medium)} \to \text{Task 3 (Hard)}\]

Example: POS tagging β†’ Parsing β†’ Semantic role labeling

Joint formulation:

\[\min_{\theta} \sum_{k=1}^{K} w_k(t) \mathcal{L}_k(\theta)\]

where \(w_k(t)\) increases for harder tasks over time.

8.2 Auxiliary Task Curriculum

Progressive dropping:

Early training: Use multiple auxiliary tasks for regularization

Late training: Focus on main task only

Task weights:

\[w_{\text{aux}}(t) = w_0 \cdot e^{-\beta t}\]

Exponentially decay auxiliary task weight.

9. Curriculum for Different ArchitecturesΒΆ

9.1 Vision Transformers

Patch size curriculum:

  • Early: Large patches (low resolution, easier)

  • Late: Small patches (high resolution, harder)

\[\text{Patch size}(t) = P_{\max} \cdot (1 - \alpha(t)) + P_{\min} \cdot \alpha(t)\]

9.2 Language Models

Sequence length curriculum:

\[L(t) = L_{\min} + (L_{\max} - L_{\min}) \cdot \lambda(t)\]

Start with short sequences, gradually increase.

9.3 Reinforcement Learning

Environment complexity:

  • Simple mazes β†’ Complex mazes

  • Few obstacles β†’ Many obstacles

  • Deterministic β†’ Stochastic

Reward shaping curriculum:

Early: Dense rewards (easier) Late: Sparse rewards (harder, more realistic)

10. Theoretical Guarantees and AnalysisΒΆ

10.1 Generalization Bounds

Theorem (Hacohen & Weinshall, 2019):

With appropriate curriculum, test error satisfies:

\[\epsilon_{\text{test}} \leq \epsilon_{\text{train}} + O\left(\sqrt{\frac{d \log n}{n}} \cdot (1 - \beta)\right)\]

where \(\beta\) is curriculum benefit factor (0 < \(\beta\) < 1).

10.2 Sample Complexity

Curriculum reduces required samples:

\[n_{\text{curriculum}} = O\left(\frac{d}{\epsilon^2}\right)$$ vs. $$n_{\text{standard}} = O\left(\frac{d \log d}{\epsilon^2}\right)\]

Logarithmic improvement in dimension dependency.

10.3 Convergence Rate

SGD with curriculum:

\[\mathbb{E}[\|\nabla \mathcal{L}(\theta_T)\|^2] \leq \frac{C}{\sqrt{T}} \cdot (1 + \beta_{\text{curriculum}})\]

where \(\beta_{\text{curriculum}} < 0\) indicates acceleration.

11. Practical ConsiderationsΒΆ

11.1 Hyperparameter Tuning

Key hyperparameters:

  • Schedule type (linear, exponential, adaptive)

  • Pacing speed (how fast to increase difficulty)

  • Initial difficulty threshold

  • Batch composition (mixed vs pure difficulty levels)

Guidelines:

  • Conservative pacing: Slower curriculum for complex tasks

  • Aggressive pacing: Faster for well-structured data

  • Adaptive: Monitor validation loss, adjust dynamically

11.2 Computational Overhead

Difficulty scoring cost:

  • Model-based: Requires forward pass β†’ \(O(n)\) per epoch

  • Data-based: Pre-compute once β†’ \(O(1)\) per epoch

Amortization strategies:

  • Compute difficulty every K epochs, not every epoch

  • Use cheaper proxy metrics (e.g., image complexity for vision)

  • Cache difficulty scores and update periodically

11.3 Curriculum Design Workflow

  1. Define difficulty metric: Choose appropriate measure for domain

  2. Sort/score dataset: Assign difficulty to each sample

  3. Choose schedule: Select pacing strategy (linear, adaptive, etc.)

  4. Monitor performance: Track convergence on validation set

  5. Adjust if needed: Modify schedule if curriculum too fast/slow

12. Advanced TechniquesΒΆ

12.1 Mixture of Difficulties

Instead of pure batches, mix difficulties:

\[\text{Batch}_t = \alpha(t) \cdot \text{Easy} + (1-\alpha(t)) \cdot \text{Hard}\]

Benefits:

  • Prevents overfitting to easy examples

  • Maintains gradient diversity

  • Smoother transition

12.2 Dynamic Difficulty Adjustment

Per-sample pacing:

\[\lambda_i(t+1) = \lambda_i(t) + \eta \cdot \mathbb{1}[\text{correct prediction}]\]

Increase difficulty for samples model handles well.

12.3 Curriculum Dropout

Randomly drop curriculum constraint with probability \(p(t)\):

\[\text{Use curriculum?} \sim \text{Bernoulli}(1 - p(t))\]

where \(p(t)\) increases over time β†’ gradual transition to random.

13. Failure Modes and MitigationsΒΆ

13.1 Common Pitfalls

Premature convergence:

  • Curriculum too slow β†’ Model overfits easy examples

  • Mitigation: Monitor validation performance, accelerate if needed

Catastrophic forgetting:

  • Hard examples introduced too late β†’ Forget easy patterns

  • Mitigation: Mixed batches, periodic review of easy examples

Poor difficulty metric:

  • Metric doesn’t align with true difficulty

  • Mitigation: Validate metric with human annotation, try multiple metrics

13.2 Debugging Strategies

Difficulty distribution analysis: Plot difficulty scores, check for reasonable spread

Learning curves: Compare curriculum vs. random sampling convergence

Sample inspection: Manually verify easy/medium/hard samples make sense

14. State-of-the-Art MethodsΒΆ

14.1 Competence-Based Curriculum (Graves et al., 2017)

Signal-to-noise ratio:

\[D_i = \frac{\|\nabla_{\theta} \mathcal{L}_i\|}{var(\mathcal{L}_i)}\]

High gradient magnitude + low variance β†’ Good for learning

14.2 Curriculum by Smoothing (Spitkovsky et al., 2010)

Progressively reduce data augmentation/noise:

\[x_t = x + \sigma(t) \cdot \epsilon, \quad \sigma(t) \to 0\]

Start with smoothed (easier) data, converge to original.

14.3 Automatic Curriculum Learning (ACL)

Meta-learning for curriculum:

\[\min_{\phi} \mathbb{E}_{\tau \sim p(\tau; \phi)}[\mathcal{L}_{\text{val}}(\theta^*(\tau))]\]

Learn curriculum policy \(\phi\) that generates sequences \(\tau\) optimizing validation loss.

15. Empirical Results and BenchmarksΒΆ

15.1 Vision Tasks

CIFAR-10/100:

  • Curriculum: 2-5% accuracy improvement

  • Faster convergence: 20-30% fewer epochs

ImageNet:

  • Curriculum: 1-2% top-1 accuracy gain

  • Reduced training time: 15-20%

15.2 NLP Tasks

Machine Translation:

  • Curriculum (shortβ†’long sentences): 2-4 BLEU improvement

  • Faster convergence: 25% fewer steps

Language Modeling:

  • Perplexity reduction: 5-10%

  • Especially effective for low-resource languages

15.3 Reinforcement Learning

Atari Games:

  • Curriculum (easyβ†’hard levels): 20-40% higher scores

  • More stable training

Robotics:

  • Sim-to-real transfer: Curriculum reduces reality gap

  • Progressive environment complexity improves generalization

16. Key Papers and TimelineΒΆ

Foundational (2009-2012):

  • Bengio et al. 2009: Curriculum Learning - Original concept and motivation

  • Kumar et al. 2010: Self-Paced Learning - Model selects easy samples

  • Lee & Grauman 2011: Learning the Easy Things First - Attributes curriculum

Methods (2013-2017):

  • Jiang et al. 2015: Self-Paced Learning with Diversity - Diverse sample selection

  • Graves et al. 2017: Automated Curriculum Learning - Meta-learning curriculum

  • Hacohen & Weinshall 2019: On The Power of Curriculum Learning - Theoretical analysis

Applications (2018-2024):

  • Soviany et al. 2021: Curriculum Learning Survey - Comprehensive review

  • Xu et al. 2020: Curriculum Learning for NLP - Text-specific strategies

  • Narvekar et al. 2020: Curriculum Learning for RL - RL-specific methods

Computational Complexity AnalysisΒΆ

Difficulty computation:

  • Model-based: \(O(n \cdot C_{\text{forward}})\) where \(C_{\text{forward}}\) is forward pass cost

  • Data-based: \(O(n \cdot d)\) where d is feature dimension

Curriculum overhead:

  • Sorting: \(O(n \log n)\) (one-time or periodic)

  • Sample selection per epoch: \(O(n)\) (filtering by threshold)

Total complexity:

\[C_{\text{total}} = C_{\text{difficulty}} + T \cdot C_{\text{selection}} + T \cdot C_{\text{training}}\]

Typically \(C_{\text{difficulty}} + C_{\text{selection}} \ll C_{\text{training}}\), so overhead is negligible.

"""
Advanced Curriculum Learning Implementations

This cell provides production-ready implementations of:
1. Self-Paced Learning (SPL) with soft sample weighting
2. Transfer Teacher Curriculum (knowledge distillation-based)
3. Competence-Based Curriculum (gradient-based difficulty)
4. Multi-Task Curriculum Learning
5. Adaptive Curriculum Scheduler
6. Mixture Curriculum (mixed difficulty batches)
7. Curriculum evaluation and visualization tools
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader, Subset, WeightedRandomSampler
from sklearn.metrics import accuracy_score
import copy
from collections import defaultdict

# ============================================================================
# Self-Paced Learning (SPL)
# ============================================================================

class SelfPacedLearning:
    """
    Self-Paced Learning (Kumar et al., 2010)
    
    Theory:
    - Jointly optimize model parameters ΞΈ and sample weights v
    - min_{ΞΈ,v} Ξ£ v_iΒ·L_i - Ξ»/2Β·||v||Β²
    - v_i ∈ [0,1] controls sample inclusion
    - Ξ»: pacing parameter (increases over time)
    """
    
    def __init__(self, model, lambda_init=1.0, lambda_growth=1.1, 
                 soft=True, temperature=1.0):
        """
        Args:
            model: Neural network to train
            lambda_init: Initial pacing parameter
            lambda_growth: Growth rate per epoch (Ξ»_t = Ξ»_{t-1} * growth)
            soft: Use soft weights (True) or hard 0/1 (False)
            temperature: Temperature for soft weighting
        """
        self.model = model
        self.lambda_param = lambda_init
        self.lambda_growth = lambda_growth
        self.soft = soft
        self.temperature = temperature
        self.sample_weights = None
        
    def compute_sample_weights(self, losses):
        """
        Compute sample weights based on current losses
        
        Args:
            losses: (N,) array of per-sample losses
            
        Returns:
            weights: (N,) array of sample weights in [0,1]
        """
        if self.soft:
            # Soft weighting: v_i = Ξ» / (Ξ» + L_i / T)
            weights = self.lambda_param / (self.lambda_param + losses / self.temperature)
        else:
            # Hard weighting: v_i = 1 if L_i < Ξ» else 0
            weights = (losses < self.lambda_param).astype(np.float32)
        
        return weights
    
    def train_epoch(self, train_loader, optimizer, device='cpu'):
        """
        Train for one epoch with self-paced sample selection
        
        Returns:
            avg_loss: Average weighted loss
            inclusion_rate: Fraction of samples with weight > 0.5
        """
        self.model.train()
        
        # First pass: compute losses for all samples
        all_losses = []
        all_data = []
        
        with torch.no_grad():
            for x, y in train_loader:
                x, y = x.to(device), y.to(device)
                output = self.model(x)
                loss = F.cross_entropy(output, y, reduction='none')
                all_losses.extend(loss.cpu().numpy())
                all_data.append((x, y))
        
        all_losses = np.array(all_losses)
        
        # Compute sample weights
        weights = self.compute_sample_weights(all_losses)
        self.sample_weights = weights
        
        # Second pass: train with weighted loss
        total_loss = 0
        num_batches = 0
        
        idx = 0
        for x, y in all_data:
            batch_size = len(y)
            batch_weights = torch.tensor(
                weights[idx:idx+batch_size], 
                dtype=torch.float32, 
                device=device
            )
            idx += batch_size
            
            # Forward pass
            output = self.model(x)
            loss = F.cross_entropy(output, y, reduction='none')
            
            # Weighted loss
            weighted_loss = (loss * batch_weights).mean()
            
            # Backward pass
            optimizer.zero_grad()
            weighted_loss.backward()
            optimizer.step()
            
            total_loss += weighted_loss.item()
            num_batches += 1
        
        # Update pacing parameter
        self.lambda_param *= self.lambda_growth
        
        avg_loss = total_loss / num_batches
        inclusion_rate = (weights > 0.5).mean()
        
        return avg_loss, inclusion_rate


# ============================================================================
# Transfer Teacher Curriculum
# ============================================================================

class TransferTeacherCurriculum:
    """
    Use pre-trained teacher to guide curriculum
    
    Theory:
    - Teacher provides difficulty scores based on confidence
    - D_i = 1 - max_y p_teacher(y|x_i)
    - Train student on samples where teacher is confident (low D_i)
    """
    
    def __init__(self, student_model, teacher_model, schedule='linear'):
        """
        Args:
            student_model: Model to train
            teacher_model: Pre-trained model (frozen)
            schedule: Difficulty threshold schedule
        """
        self.student = student_model
        self.teacher = teacher_model
        self.teacher.eval()  # Freeze teacher
        self.schedule = schedule
        self.difficulties = None
    
    def compute_difficulties(self, dataset, device='cpu'):
        """
        Compute teacher-based difficulty for all samples
        
        Returns:
            difficulties: (N,) array of difficulty scores
        """
        loader = DataLoader(dataset, batch_size=256, shuffle=False)
        difficulties = []
        
        with torch.no_grad():
            for x, y in loader:
                x = x.to(device)
                output = self.teacher(x)
                probs = F.softmax(output, dim=1)
                
                # Difficulty = 1 - max probability (teacher confidence)
                max_probs = probs.max(dim=1)[0]
                difficulty = 1 - max_probs
                
                difficulties.extend(difficulty.cpu().numpy())
        
        self.difficulties = np.array(difficulties)
        return self.difficulties
    
    def get_curriculum_subset(self, epoch, total_epochs):
        """
        Get sample indices for current epoch based on teacher difficulty
        
        Args:
            epoch: Current epoch (0-indexed)
            total_epochs: Total number of epochs
            
        Returns:
            indices: Array of selected sample indices
        """
        progress = epoch / total_epochs
        
        if self.schedule == 'linear':
            threshold = progress
        elif self.schedule == 'exponential':
            threshold = 1 - np.exp(-3 * progress)
        else:
            threshold = 1.0
        
        # Select samples below difficulty threshold
        max_difficulty = np.percentile(self.difficulties, threshold * 100)
        indices = np.where(self.difficulties <= max_difficulty)[0]
        
        return indices


# ============================================================================
# Competence-Based Curriculum
# ============================================================================

class CompetenceBasedCurriculum:
    """
    Competence-Based Curriculum (Graves et al., 2017)
    
    Theory:
    - Difficulty based on gradient signal-to-noise ratio
    - D_i = ||βˆ‡L_i|| / Var(L_i)
    - Prioritize samples with strong, consistent gradients
    """
    
    def __init__(self, model, window_size=100):
        """
        Args:
            model: Neural network
            window_size: Window for computing loss variance
        """
        self.model = model
        self.window_size = window_size
        self.loss_history = defaultdict(list)  # Per-sample loss history
    
    def compute_competence_scores(self, dataset, device='cpu'):
        """
        Compute competence score for each sample
        
        Returns:
            scores: (N,) array where higher = better for learning
        """
        loader = DataLoader(dataset, batch_size=1, shuffle=False)
        scores = []
        
        for idx, (x, y) in enumerate(loader):
            x, y = x.to(device), y.to(device)
            x.requires_grad = True
            
            # Forward pass
            output = self.model(x)
            loss = F.cross_entropy(output, y)
            
            # Compute gradient magnitude
            loss.backward()
            grad_norm = torch.norm(x.grad).item()
            
            # Track loss history
            self.loss_history[idx].append(loss.item())
            if len(self.loss_history[idx]) > self.window_size:
                self.loss_history[idx].pop(0)
            
            # Compute variance
            if len(self.loss_history[idx]) >= 2:
                loss_var = np.var(self.loss_history[idx])
                loss_var = max(loss_var, 1e-6)  # Avoid division by zero
            else:
                loss_var = 1.0
            
            # Competence = gradient magnitude / loss variance
            competence = grad_norm / loss_var
            scores.append(competence)
        
        return np.array(scores)


# ============================================================================
# Multi-Task Curriculum
# ============================================================================

class MultiTaskCurriculum:
    """
    Curriculum across multiple tasks
    
    Theory:
    - Train on easier tasks first, progressively add harder tasks
    - L_total = Ξ£ w_k(t)Β·L_k where w_k increases for harder tasks
    """
    
    def __init__(self, model, task_difficulties, schedule='linear'):
        """
        Args:
            model: Multi-task model with task-specific heads
            task_difficulties: List of task difficulty scores (higher = harder)
            schedule: How to schedule task weights over time
        """
        self.model = model
        self.task_difficulties = np.array(task_difficulties)
        self.num_tasks = len(task_difficulties)
        self.schedule = schedule
        
        # Sort tasks by difficulty
        self.task_order = np.argsort(self.task_difficulties)
    
    def get_task_weights(self, epoch, total_epochs):
        """
        Compute task weights for current epoch
        
        Returns:
            weights: (num_tasks,) array of task weights
        """
        progress = epoch / total_epochs
        weights = np.zeros(self.num_tasks)
        
        if self.schedule == 'sequential':
            # Train one task at a time in order
            task_idx = min(int(progress * self.num_tasks), self.num_tasks - 1)
            weights[self.task_order[task_idx]] = 1.0
            
        elif self.schedule == 'progressive':
            # Gradually add tasks
            num_active = int(progress * self.num_tasks) + 1
            for i in range(num_active):
                weights[self.task_order[i]] = 1.0
            weights = weights / weights.sum()
            
        elif self.schedule == 'smooth':
            # Smooth transition with sigmoid
            for i, task_idx in enumerate(self.task_order):
                # Activate task i at progress i/num_tasks
                activation_point = i / self.num_tasks
                weights[task_idx] = 1 / (1 + np.exp(-10 * (progress - activation_point)))
            weights = weights / weights.sum()
        
        return weights


# ============================================================================
# Adaptive Curriculum Scheduler
# ============================================================================

class AdaptiveCurriculumScheduler:
    """
    Adaptively adjust curriculum pace based on validation performance
    
    Theory:
    - If validation accuracy increases, accelerate curriculum
    - If validation accuracy decreases, slow down or revert
    """
    
    def __init__(self, initial_threshold=0.3, acceleration=0.1, 
                 patience=3, min_threshold=0.0, max_threshold=1.0):
        """
        Args:
            initial_threshold: Starting difficulty threshold
            acceleration: How much to increase threshold on success
            patience: Epochs to wait before adjusting
            min_threshold: Minimum threshold value
            max_threshold: Maximum threshold value
        """
        self.threshold = initial_threshold
        self.acceleration = acceleration
        self.patience = patience
        self.min_threshold = min_threshold
        self.max_threshold = max_threshold
        
        self.best_val_acc = 0.0
        self.wait = 0
        self.threshold_history = [initial_threshold]
    
    def step(self, val_acc):
        """
        Update threshold based on validation accuracy
        
        Args:
            val_acc: Current validation accuracy
            
        Returns:
            new_threshold: Updated difficulty threshold
        """
        if val_acc > self.best_val_acc:
            # Improvement: accelerate curriculum
            self.best_val_acc = val_acc
            self.wait = 0
            self.threshold = min(self.threshold + self.acceleration, self.max_threshold)
        else:
            # No improvement: wait or decelerate
            self.wait += 1
            if self.wait >= self.patience:
                # Slow down curriculum
                self.threshold = max(self.threshold - self.acceleration / 2, self.min_threshold)
                self.wait = 0
        
        self.threshold_history.append(self.threshold)
        return self.threshold


# ============================================================================
# Mixture Curriculum
# ============================================================================

class MixtureCurriculum:
    """
    Sample batches with mixture of difficulties
    
    Theory:
    - Batch_t = Ξ±(t)Β·Easy + (1-Ξ±(t))Β·Hard
    - Prevents overfitting to easy examples
    - Maintains gradient diversity
    """
    
    def __init__(self, dataset, difficulties, batch_size=128, 
                 easy_ratio_start=0.8, easy_ratio_end=0.2):
        """
        Args:
            dataset: Full dataset
            difficulties: (N,) array of sample difficulties
            batch_size: Batch size
            easy_ratio_start: Initial fraction of easy samples in batch
            easy_ratio_end: Final fraction of easy samples in batch
        """
        self.dataset = dataset
        self.difficulties = difficulties
        self.batch_size = batch_size
        self.easy_ratio_start = easy_ratio_start
        self.easy_ratio_end = easy_ratio_end
        
        # Split into easy/medium/hard
        self.easy_idx = np.where(difficulties < np.percentile(difficulties, 33))[0]
        self.medium_idx = np.where((difficulties >= np.percentile(difficulties, 33)) & 
                                   (difficulties < np.percentile(difficulties, 67)))[0]
        self.hard_idx = np.where(difficulties >= np.percentile(difficulties, 67))[0]
    
    def get_mixed_batch(self, epoch, total_epochs):
        """
        Sample batch with mixture of difficulties
        
        Returns:
            batch_indices: Indices for current batch
        """
        progress = epoch / total_epochs
        
        # Linear interpolation of easy ratio
        easy_ratio = self.easy_ratio_start + (self.easy_ratio_end - self.easy_ratio_start) * progress
        hard_ratio = 1 - easy_ratio
        
        # Sample counts
        num_easy = int(self.batch_size * easy_ratio * 0.5)
        num_medium = int(self.batch_size * easy_ratio * 0.5)
        num_hard = self.batch_size - num_easy - num_medium
        
        # Sample from each difficulty level
        easy_samples = np.random.choice(self.easy_idx, size=num_easy, replace=False)
        medium_samples = np.random.choice(self.medium_idx, size=num_medium, replace=False)
        hard_samples = np.random.choice(self.hard_idx, size=num_hard, replace=False)
        
        # Combine and shuffle
        batch_indices = np.concatenate([easy_samples, medium_samples, hard_samples])
        np.random.shuffle(batch_indices)
        
        return batch_indices


# ============================================================================
# Curriculum Evaluation Tools
# ============================================================================

class CurriculumEvaluator:
    """Tools for evaluating and visualizing curriculum effectiveness"""
    
    @staticmethod
    def plot_difficulty_distribution(difficulties, num_bins=50):
        """Plot histogram of difficulty scores"""
        plt.figure(figsize=(10, 5))
        plt.hist(difficulties, bins=num_bins, alpha=0.7, edgecolor='black')
        plt.xlabel('Difficulty Score', fontsize=12)
        plt.ylabel('Frequency', fontsize=12)
        plt.title('Difficulty Distribution', fontsize=13)
        plt.axvline(difficulties.mean(), color='red', linestyle='--', 
                   label=f'Mean: {difficulties.mean():.3f}')
        plt.axvline(np.median(difficulties), color='green', linestyle='--',
                   label=f'Median: {np.median(difficulties):.3f}')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        return plt.gcf()
    
    @staticmethod
    def plot_curriculum_progress(threshold_history, sample_counts):
        """Plot curriculum progression over epochs"""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        epochs = np.arange(len(threshold_history))
        
        # Threshold progression
        ax1.plot(epochs, threshold_history, 'b-o', linewidth=2, markersize=6)
        ax1.fill_between(epochs, threshold_history, alpha=0.3)
        ax1.set_xlabel('Epoch', fontsize=12)
        ax1.set_ylabel('Difficulty Threshold', fontsize=12)
        ax1.set_title('Curriculum Threshold Progression', fontsize=13)
        ax1.grid(True, alpha=0.3)
        
        # Sample count progression
        ax2.plot(epochs, sample_counts, 'g-o', linewidth=2, markersize=6)
        ax2.fill_between(epochs, sample_counts, alpha=0.3, color='green')
        ax2.set_xlabel('Epoch', fontsize=12)
        ax2.set_ylabel('Number of Training Samples', fontsize=12)
        ax2.set_title('Training Set Size Over Time', fontsize=13)
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        return fig
    
    @staticmethod
    def compare_curricula(results_dict, metric='test_accuracy'):
        """
        Compare multiple curriculum strategies
        
        Args:
            results_dict: {name: {'train_loss': [...], 'test_acc': [...]}}
            metric: Which metric to plot ('train_loss' or 'test_accuracy')
        """
        plt.figure(figsize=(12, 6))
        
        for name, results in results_dict.items():
            epochs = np.arange(len(results[metric]))
            plt.plot(epochs, results[metric], '-o', label=name, linewidth=2, markersize=5)
        
        plt.xlabel('Epoch', fontsize=12)
        plt.ylabel(metric.replace('_', ' ').title(), fontsize=12)
        plt.title(f'Curriculum Strategy Comparison - {metric.replace("_", " ").title()}', 
                 fontsize=13)
        plt.legend(fontsize=11)
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        return plt.gcf()
    
    @staticmethod
    def plot_sample_weight_evolution(weight_history, sample_indices):
        """
        Plot how sample weights evolve over training (for SPL)
        
        Args:
            weight_history: (epochs, num_samples) array of weights
            sample_indices: Indices of samples to track
        """
        plt.figure(figsize=(12, 6))
        
        epochs = np.arange(weight_history.shape[0])
        
        for idx in sample_indices:
            plt.plot(epochs, weight_history[:, idx], '-', 
                    label=f'Sample {idx}', alpha=0.7, linewidth=2)
        
        plt.xlabel('Epoch', fontsize=12)
        plt.ylabel('Sample Weight', fontsize=12)
        plt.title('Sample Weight Evolution (Self-Paced Learning)', fontsize=13)
        plt.legend(fontsize=10, ncol=2)
        plt.grid(True, alpha=0.3)
        plt.ylim(-0.05, 1.05)
        plt.tight_layout()
        return plt.gcf()


# ============================================================================
# Demonstration
# ============================================================================

print("Advanced Curriculum Learning Methods Implemented:")
print("=" * 70)
print("1. SelfPacedLearning - Joint optimization of model and sample weights")
print("2. TransferTeacherCurriculum - Teacher-guided difficulty scoring")
print("3. CompetenceBasedCurriculum - Gradient signal-to-noise ratio")
print("4. MultiTaskCurriculum - Progressive task scheduling")
print("5. AdaptiveCurriculumScheduler - Performance-based pacing")
print("6. MixtureCurriculum - Mixed-difficulty batches")
print("7. CurriculumEvaluator - Visualization and comparison tools")
print("=" * 70)

# Example: Self-Paced Learning
print("\nExample: Self-Paced Learning")
print("-" * 70)

# Simple dataset and model for demonstration
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor()])
mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)

train_loader_simple = DataLoader(mnist_train, batch_size=128, shuffle=True)
test_loader_simple = DataLoader(mnist_test, batch_size=1000, shuffle=False)

# Simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = x.view(-1, 784)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_spl = SimpleNet().to(device)
optimizer_spl = torch.optim.Adam(model_spl.parameters(), lr=1e-3)

# Initialize SPL
spl = SelfPacedLearning(
    model_spl, 
    lambda_init=0.5, 
    lambda_growth=1.15,
    soft=True,
    temperature=1.0
)

print(f"Initial Ξ»: {spl.lambda_param:.3f}")
print("Training for 5 epochs with Self-Paced Learning...")

inclusion_rates = []
for epoch in range(5):
    avg_loss, inclusion_rate = spl.train_epoch(train_loader_simple, optimizer_spl, device)
    inclusion_rates.append(inclusion_rate)
    
    # Quick eval
    model_spl.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for x, y in test_loader_simple:
            x, y = x.to(device), y.to(device)
            output = model_spl(x)
            pred = output.argmax(1)
            correct += (pred == y).sum().item()
            total += len(y)
    
    acc = 100 * correct / total
    print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Ξ»={spl.lambda_param:.3f}, "
          f"Inclusion={inclusion_rate:.2%}, Test Acc={acc:.2f}%")

# Visualize sample weight distribution
plt.figure(figsize=(10, 5))
plt.hist(spl.sample_weights, bins=50, alpha=0.7, edgecolor='black', color='purple')
plt.xlabel('Sample Weight', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Final Sample Weight Distribution (Self-Paced Learning)', fontsize=13)
plt.axvline(0.5, color='red', linestyle='--', label='Inclusion Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nFinal statistics:")
print(f"  Samples with weight > 0.5: {(spl.sample_weights > 0.5).sum()}/{len(spl.sample_weights)}")
print(f"  Mean weight: {spl.sample_weights.mean():.3f}")
print(f"  Weight std: {spl.sample_weights.std():.3f}")

print("\n" + "=" * 70)
print("Key Takeaways:")
print("=" * 70)
print("1. Self-Paced Learning: Model automatically selects easy samples early")
print("2. Ξ» parameter: Controls pacing (increases over time to include harder samples)")
print("3. Soft weights: Smoother than hard 0/1, better gradient flow")
print("4. Inclusion rate: Starts low, gradually increases as model improves")
print("5. Applications: Noisy labels, imbalanced data, domain adaptation")
print("=" * 70)

1. Curriculum LearningΒΆ

ConceptΒΆ

Train on easier examples first, gradually increase difficulty:

\[\mathcal{D}_t = \{(x_i, y_i) : \text{difficulty}(x_i) \leq \lambda(t)\}\]

where \(\lambda(t)\) increases with training step \(t\).

Benefits:ΒΆ

  • Faster convergence

  • Better generalization

  • Escape local minima

πŸ“š Reference Materials:

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = x.view(-1, 784)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

# Load MNIST
transform = transforms.Compose([transforms.ToTensor()])
mnist = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_mnist = datasets.MNIST('./data', train=False, download=True, transform=transform)

train_loader = torch.utils.data.DataLoader(mnist, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_mnist, batch_size=1000)

print("Data loaded")

Difficulty ScoringΒΆ

Curriculum learning requires a way to rank training examples by difficulty. Common approaches include loss-based scoring (examples with higher initial loss are harder), model confidence (examples predicted with lower probability are harder), and domain heuristics (e.g., sentence length for NLP, image complexity for vision). The difficulty scores determine the order in which examples are presented during training – starting with easy examples and gradually introducing harder ones. The scoring function should correlate with genuine learning difficulty, not just noise, so using a pre-trained model’s predictions often works better than random initialization loss.

def compute_difficulty_scores(model, dataset):
    """Compute difficulty for each sample."""
    model.eval()
    difficulties = []
    
    loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=False)
    
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            output = model(x)
            
            # Difficulty = 1 - confidence on true class
            probs = F.softmax(output, dim=1)
            confidence = probs[torch.arange(len(y)), y]
            difficulty = 1 - confidence
            
            difficulties.extend(difficulty.cpu().numpy())
    
    return np.array(difficulties)

# Initialize model and compute initial difficulties
model = SimpleNet().to(device)
difficulties = compute_difficulty_scores(model, mnist)

plt.figure(figsize=(10, 5))
plt.hist(difficulties, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Difficulty Score', fontsize=11)
plt.ylabel('Frequency', fontsize=11)
plt.title('Initial Difficulty Distribution', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Mean difficulty: {difficulties.mean():.3f}")

Curriculum SchedulerΒΆ

The curriculum scheduler controls how quickly the training set expands from easy to hard examples. Common schedules include linear (add a fixed fraction of harder examples each epoch), exponential (double the effective dataset size at regular intervals), and self-paced (let the model’s current performance determine when to include harder examples). The key hyperparameter is the pace: too fast and the curriculum collapses to standard random training; too slow and the model wastes time on easy examples it has already mastered. Self-paced learning adaptively adjusts the difficulty threshold based on the model’s current loss distribution, providing an automatic curriculum.

class CurriculumScheduler:
    """Schedule curriculum difficulty."""
    
    def __init__(self, difficulties, schedule='linear', epochs=10):
        self.difficulties = difficulties
        self.schedule = schedule
        self.epochs = epochs
    
    def get_threshold(self, epoch):
        """Get difficulty threshold for epoch."""
        progress = epoch / self.epochs
        
        if self.schedule == 'linear':
            threshold = progress
        elif self.schedule == 'exponential':
            threshold = 1 - np.exp(-3 * progress)
        elif self.schedule == 'root':
            threshold = np.sqrt(progress)
        else:
            threshold = 1.0
        
        return threshold
    
    def get_indices(self, epoch):
        """Get sample indices for current epoch."""
        threshold = self.get_threshold(epoch)
        max_difficulty = np.percentile(self.difficulties, threshold * 100)
        indices = np.where(self.difficulties <= max_difficulty)[0]
        return indices

# Visualize schedules
scheduler = CurriculumScheduler(difficulties, epochs=10)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

epochs = np.arange(10)
for schedule in ['linear', 'exponential', 'root']:
    scheduler.schedule = schedule
    thresholds = [scheduler.get_threshold(e) for e in epochs]
    axes[0].plot(epochs, thresholds, marker='o', label=schedule.capitalize())

axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Difficulty Threshold', fontsize=11)
axes[0].set_title('Curriculum Schedules', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Sample sizes
scheduler.schedule = 'linear'
sizes = [len(scheduler.get_indices(e)) for e in epochs]
axes[1].bar(epochs, sizes, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Training Set Size', fontsize=11)
axes[1].set_title('Progressive Training Set Growth', fontsize=12)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Train with CurriculumΒΆ

Training with the curriculum involves starting from the easiest subset and progressively expanding to include harder examples according to the scheduler. At each stage, the model trains on the current subset until performance plateaus, then the difficulty threshold advances. This staged approach often reaches higher final accuracy and converges faster than standard training, particularly for datasets with significant difficulty variance or noisy labels. The intuition mirrors human learning: mastering fundamentals before tackling advanced material builds a more robust foundation.

def train_with_curriculum(model, dataset, test_loader, scheduler, n_epochs=10):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    train_losses = []
    test_accs = []
    
    for epoch in range(n_epochs):
        # Get curriculum subset
        indices = scheduler.get_indices(epoch)
        subset = torch.utils.data.Subset(dataset, indices)
        loader = torch.utils.data.DataLoader(subset, batch_size=128, shuffle=True)
        
        # Train
        model.train()
        epoch_loss = 0
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            
            output = model(x)
            loss = F.cross_entropy(output, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        train_losses.append(epoch_loss / len(loader))
        
        # Evaluate
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                output = model(x)
                pred = output.argmax(dim=1)
                correct += (pred == y).sum().item()
                total += y.size(0)
        
        acc = 100 * correct / total
        test_accs.append(acc)
        
        print(f"Epoch {epoch+1}: {len(indices)} samples, Acc: {acc:.2f}%")
    
    return train_losses, test_accs

# Train with curriculum
model_curriculum = SimpleNet().to(device)
scheduler = CurriculumScheduler(difficulties, schedule='linear', epochs=10)
losses_curr, accs_curr = train_with_curriculum(model_curriculum, mnist, test_loader, scheduler, 10)

Baseline TrainingΒΆ

For a fair comparison, we train the same model architecture with standard random-order training (no curriculum). The baseline uses identical hyperparameters (learning rate, batch size, optimizer, total epochs) so that any performance difference can be attributed to the training order rather than other factors. Running multiple random seeds for both curriculum and baseline training provides statistical significance for the comparison.

def train_baseline(model, train_loader, test_loader, n_epochs=10):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    train_losses = []
    test_accs = []
    
    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            
            output = model(x)
            loss = F.cross_entropy(output, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        train_losses.append(epoch_loss / len(train_loader))
        
        # Evaluate
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for x, y in test_loader:
                x, y = x.to(device), y.to(device)
                output = model(x)
                pred = output.argmax(dim=1)
                correct += (pred == y).sum().item()
                total += y.size(0)
        
        acc = 100 * correct / total
        test_accs.append(acc)
        print(f"Epoch {epoch+1}, Acc: {acc:.2f}%")
    
    return train_losses, test_accs

# Train baseline
model_baseline = SimpleNet().to(device)
losses_base, accs_base = train_baseline(model_baseline, train_loader, test_loader, 10)

Compare ResultsΒΆ

Comparing the learning curves and final accuracies of curriculum training versus baseline random training reveals the impact of example ordering. A successful curriculum typically shows faster initial improvement (the model quickly learns from easy examples) and higher final accuracy (gradual difficulty increase provides better regularization). The benefit is most pronounced on tasks with high difficulty variance, noisy labels, or limited training data. On clean, balanced datasets the advantage may be small, which is itself an informative finding.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training loss
axes[0].plot(losses_base, 'b-o', label='Baseline', markersize=5)
axes[0].plot(losses_curr, 'r-o', label='Curriculum', markersize=5)
axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Training Loss', fontsize=11)
axes[0].set_title('Training Loss Comparison', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test accuracy
axes[1].plot(accs_base, 'b-o', label='Baseline', markersize=5)
axes[1].plot(accs_curr, 'r-o', label='Curriculum', markersize=5)
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=11)
axes[1].set_title('Test Accuracy Comparison', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal Accuracy - Baseline: {accs_base[-1]:.2f}%, Curriculum: {accs_curr[-1]:.2f}%")

SummaryΒΆ

Curriculum Learning:ΒΆ

  1. Easy-to-hard training progression

  2. Difficulty scoring based on model confidence

  3. Adaptive scheduling (linear, exponential, root)

  4. Faster convergence in early stages

Difficulty Metrics:ΒΆ

  • Model confidence (1 - p(y|x))

  • Loss magnitude

  • Prediction variance

  • Domain-specific heuristics

Applications:ΒΆ

  • Image classification

  • Language modeling

  • Reinforcement learning

  • Neural architecture search

Variants:ΒΆ

  • Self-paced learning: Model selects samples

  • Transfer teacher: Use pre-trained model

  • Anti-curriculum: Hard-to-easy for robustness

  • Dynamic curriculum: Adapt based on progress