Phase 25: Reinforcement Learning β Start HereΒΆ
Train agents to make decisions through trial and error β the same technology behind AlphaGo, game-playing AIs, and RLHF for LLMs.
What Is Reinforcement Learning?ΒΆ
Agent observes State β takes Action β receives Reward β updates Policy
β__________________________________β
(environment loop)
RL is also the engine behind RLHF (Reinforcement Learning from Human Feedback) β how ChatGPT and Claude were trained to be helpful and safe.
Notebooks in This PhaseΒΆ
Notebook |
Topic |
|---|---|
|
MDPs: states, actions, rewards, Bellman equations |
|
Tabular Q-learning and temporal difference learning |
|
DQN with neural networks (Atari games) |
|
REINFORCE, Actor-Critic, PPO |
|
RLHF, multi-agent RL, real-world applications |
|
OpenAI Gym environments, hands-on projects |
Key AlgorithmsΒΆ
Algorithm |
Type |
Use Case |
|---|---|---|
Q-Learning |
Value-based |
Simple discrete action spaces |
DQN |
Value-based |
Atari, discrete actions |
PPO |
Policy-based |
Most practical RL tasks |
SAC |
Actor-Critic |
Continuous control (robotics) |
RLHF |
Human feedback |
Fine-tuning LLMs |
PrerequisitesΒΆ
Neural Networks (Phase 06)
Probability and statistics (Phase 03)
PyTorch basics
Learning PathΒΆ
01_markov_decision_processes.ipynb β Start here
02_q_learning.ipynb
03_deep_q_networks.ipynb
04_policy_based_methods.ipynb
05_advanced_topics_applications.ipynb
06_practical_exercises.ipynb β Build and train agents