RLHF & Constitutional AI: Alignment TrainingΒΆ
How ChatGPT and Claude learned to be helpful, harmless, and honest β the complete alignment training pipeline.
The Three-Stage Alignment PipelineΒΆ
Stage 1: Supervised Fine-Tuning (SFT)
Pretrained model + human-written ideal responses
β Model learns to follow instructions
Stage 2: Reward Modeling (RM)
Human comparisons: "response A is better than B"
β Reward model learns human preferences
Stage 3: RLHF / PPO / DPO
SFT model optimized to maximize reward model score
β Model generates responses humans prefer
Modern Alternatives to RLHFΒΆ
Method |
Year |
Key Idea |
Training |
|---|---|---|---|
RLHF + PPO |
2022 |
RL policy gradient |
Complex, unstable |
DPO |
2023 |
Direct preference optimization |
Simpler, no RL |
Constitutional AI |
2022 |
Self-critique with principles |
Scalable supervision |
GRPO |
2024 |
Group relative policy opt. |
DeepSeekβs method |
ORPO |
2024 |
Odds ratio preference opt. |
Single-stage SFT+align |
# Install dependencies
# !pip install transformers trl datasets peft accelerate bitsandbytes
# TRL (Transformer Reinforcement Learning) by HuggingFace covers all alignment methods
1. Stage 1: Supervised Fine-Tuning (SFT)ΒΆ
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
from peft import LoraConfig
# Example SFT dataset format (instruction-response pairs)
sft_data = [
{
'messages': [
{'role': 'user', 'content': 'What is the capital of France?'},
{'role': 'assistant', 'content': 'The capital of France is Paris.'}
]
},
{
'messages': [
{'role': 'user', 'content': 'Write a Python function to reverse a string.'},
{'role': 'assistant', 'content': 'def reverse_string(s: str) -> str:\n return s[::-1]'}
]
},
]
def run_sft(
model_id: str,
dataset: Dataset,
output_dir: str = './sft_output',
num_epochs: int = 1
) -> None:
"""Run supervised fine-tuning with QLoRA."""
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map='auto'
)
peft_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=['q_proj', 'v_proj'],
lora_dropout=0.05, bias='none',
task_type='CAUSAL_LM'
)
training_args = SFTConfig(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
peft_config=peft_config,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model()
print('SFT pipeline defined (see Phase 12 notebooks 02 and 04 for full walkthrough).')
2. Stage 2: Reward Model TrainingΒΆ
from trl import RewardTrainer, RewardConfig
# Reward model dataset format:
# For each prompt, human annotators rank multiple responses
# "chosen" = preferred response, "rejected" = dispreferred
reward_data_example = [
{
'prompt': 'How do I improve my writing?',
'chosen': 'Practice daily journaling, read widely across genres, and seek feedback from others. Focus on clarity: say exactly what you mean with as few words as possible.',
'rejected': 'Write more.'
},
{
'prompt': 'What is 2 + 2?',
'chosen': '4.',
'rejected': '5.'
},
]
def train_reward_model(
base_model_id: str,
preference_dataset,
output_dir: str = './reward_model'
) -> None:
"""
Train a reward model from preference data.
The reward model outputs a scalar score for any (prompt, response) pair.
"""
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Use a classification head on top of LLM
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
base_model_id,
num_labels=1 # Scalar reward score
)
config = RewardConfig(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=8,
learning_rate=1e-5,
)
trainer = RewardTrainer(
model=model,
args=config,
train_dataset=preference_dataset,
processing_class=tokenizer,
)
trainer.train()
print('Reward model training pipeline defined.')
print()
print('Key insight: The reward model maps (prompt + response) β scalar.')
print('Higher score = more aligned with human preferences.')
3. Stage 3a: PPO (Proximal Policy Optimization)ΒΆ
from trl import PPOTrainer, PPOConfig
# PPO training loop:
# 1. Actor (SFT model) generates responses
# 2. Reward model scores them
# 3. PPO updates actor to maximize reward
# 4. KL divergence penalty prevents drifting too far from SFT model
def ppo_training_loop_pseudocode():
"""
Conceptual PPO training loop for RLHF.
In practice, use trl.PPOTrainer which handles all the details.
"""
# Setup:
# actor = sft_model (policy to be optimized)
# reference = frozen copy of sft_model (KL reference)
# reward_model = trained reward model
# critic = value network (estimates baseline reward)
for batch in training_prompts:
# Step 1: Generate responses with actor
responses = actor.generate(batch['prompts'], do_sample=True)
# Step 2: Score with reward model
rewards = reward_model(batch['prompts'], responses)
# Step 3: KL penalty (prevents reward hacking)
actor_logprobs = actor.log_probs(responses)
ref_logprobs = reference.log_probs(responses)
kl_penalty = actor_logprobs - ref_logprobs # penalize divergence
adjusted_rewards = rewards - 0.1 * kl_penalty # Ξ²=0.1 is typical
# Step 4: PPO update (clipped surrogate objective)
actor.ppo_step(adjusted_rewards)
print('PPO RLHF conceptual loop explained.')
print()
print('Why PPO is hard:')
print(' β’ Requires 4 models in memory: actor, reference, reward, critic')
print(' β’ Unstable training β KL coefficient needs careful tuning')
print(' β’ Reward hacking: model learns to fool reward model')
print(' β’ β DPO was invented to avoid all of this')
4. Stage 3b: DPO (Direct Preference Optimization) β The Simpler WayΒΆ
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig, get_peft_model
import torch
# DPO eliminates the need for a separate reward model.
# It directly optimizes the policy on preference data.
# Loss: log Ο(Ξ² * (log Ο(chosen|x)/Ο_ref(chosen|x) - log Ο(rejected|x)/Ο_ref(rejected|x)))
# β Increase probability of chosen, decrease probability of rejected
def run_dpo(
sft_model_path: str,
preference_dataset,
output_dir: str = './dpo_output',
beta: float = 0.1 # KL coefficient β higher = more conservative
) -> None:
"""
DPO training β no reward model needed!
beta: temperature for the DPO loss (0.1 is standard)
"""
tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
model = AutoModelForCausalLM.from_pretrained(
sft_model_path,
torch_dtype=torch.bfloat16,
device_map='auto'
)
peft_config = LoraConfig(
r=64, lora_alpha=16,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
bias='none', task_type='CAUSAL_LM'
)
model = get_peft_model(model, peft_config)
dpo_config = DPOConfig(
output_dir=output_dir,
beta=beta,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7, # Very small LR for DPO
bf16=True,
)
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=preference_dataset,
processing_class=tokenizer,
)
trainer.train()
print('DPO training pipeline defined.')
print()
print('DPO vs PPO:')
print(' DPO: simpler, stable, no reward model needed β recommended for most cases')
print(' PPO: more flexible but requires 4 models and careful hypertuning')
5. Constitutional AI β Self-Supervised AlignmentΒΆ
from openai import OpenAI
client = OpenAI()
# Constitutional AI (Anthropic, 2022):
# Instead of expensive human feedback, use principles + AI critique
# 1. Generate initial response
# 2. Critique it against a "constitution" of principles
# 3. Revise based on the critique
# 4. Use revised responses as SFT data or preference pairs
CONSTITUTION = [
'Please choose the response that is the most helpful, harmless, and honest.',
'Please choose the response that best avoids discrimination based on race, gender, or religion.',
'Please choose the response that is most truthful and does not spread misinformation.',
'Please choose the response that does not assist with illegal activities.',
'Please choose the response that respects individual privacy.',
]
def constitutional_critique_and_revise(prompt: str, initial_response: str, principle: str) -> str:
"""
Critique a response against a principle, then generate a better version.
This is the core loop of Constitutional AI.
"""
# Step 1: Critique
critique_prompt = f"""Human: {prompt}
Assistant: {initial_response}
Critique: Identify any ways in which the above response violates this principle:
"{principle}"
If there are no violations, say "No violations found."""
critique = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': critique_prompt}],
max_tokens=300
).choices[0].message.content
# Step 2: Revise
revision_prompt = f"""Human: {prompt}
Assistant: {initial_response}
Critique: {critique}
Revision: Please rewrite the assistant response to fix the issues identified in the critique."""
revised = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': revision_prompt}],
max_tokens=500
).choices[0].message.content
return revised
def constitutional_ai_pipeline(prompt: str, model: str = 'gpt-4o-mini') -> dict:
"""
Full Constitutional AI pipeline:
1. Generate initial response
2. Apply each constitutional principle
3. Return progressively refined response
"""
# Initial response
initial = client.chat.completions.create(
model=model,
messages=[{'role': 'user', 'content': prompt}],
max_tokens=500
).choices[0].message.content
current = initial
revisions = []
# Apply 2 random principles (applying all is expensive)
import random
selected_principles = random.sample(CONSTITUTION, 2)
for principle in selected_principles:
revised = constitutional_critique_and_revise(prompt, current, principle)
revisions.append({'principle': principle, 'revision': revised})
current = revised
return {'initial': initial, 'final': current, 'revisions': revisions}
# Demo
print('Constitutional AI pipeline defined.')
print()
print('This is how Anthropic trained Claude to be helpful, harmless, and honest:')
print(' 1. Generate many (prompt, response) pairs')
print(' 2. Critique and revise each with constitutional principles')
print(' 3. Use original vs. revised as DPO preference pairs')
print(' 4. Train the model on these AI-generated preferences')
print()
print('Key benefit: No expensive human labeling β scales automatically.')
6. Build Your Own ConstitutionΒΆ
# You can define domain-specific constitutions for your use case
CUSTOMER_SERVICE_CONSTITUTION = [
'Choose the response that best resolves the customer\'s issue efficiently.',
'Choose the response that is most empathetic and professional.',
'Choose the response that follows company policy and does not make unauthorized promises.',
'Choose the response that escalates appropriately when the issue is beyond scope.',
]
MEDICAL_AI_CONSTITUTION = [
'Choose the response that always recommends consulting a licensed medical professional.',
'Choose the response that does not provide specific dosage or treatment recommendations.',
'Choose the response that presents information accurately with appropriate uncertainty.',
'Choose the response that is inclusive and does not make assumptions about patient demographics.',
]
CODING_ASSISTANT_CONSTITUTION = [
'Choose the response with code that follows security best practices.',
'Choose the response that is most readable and maintainable.',
'Choose the response that handles edge cases and errors appropriately.',
'Choose the response that uses the simplest solution that solves the problem.',
]
print('Domain-specific constitutions examples:')
print()
for name, constitution in [
('Customer Service', CUSTOMER_SERVICE_CONSTITUTION),
('Medical AI', MEDICAL_AI_CONSTITUTION),
('Coding Assistant', CODING_ASSISTANT_CONSTITUTION)
]:
print(f'{name}:')
for principle in constitution[:2]:
print(f' - {principle[:70]}...')
print()
Summary: Choose Your Alignment MethodΒΆ
Goal β Method
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Best quality, have GPU + data β PPO (original RLHF)
Simpler training, have preferences β DPO (recommended)
No human data available β Constitutional AI
Reasoning improvement β GRPO (see notebook 08)
Single-stage SFT + alignment β ORPO
ExercisesΒΆ
Create a 20-example preference dataset for a specific domain (cooking, coding, etc.).
Run the Constitutional AI pipeline on 5 prompts and compare initial vs. revised responses.
Implement a simple reward model that scores responses on helpfulness (1-10 scale).
Run DPO fine-tuning on a small model (GPT-2) with your preference dataset.
Design a constitution for an AI tutor that avoids giving direct answers to homework.