RLHF & Constitutional AI: Alignment TrainingΒΆ

How ChatGPT and Claude learned to be helpful, harmless, and honest β€” the complete alignment training pipeline.

The Three-Stage Alignment PipelineΒΆ

Stage 1: Supervised Fine-Tuning (SFT)
   Pretrained model + human-written ideal responses
   β†’ Model learns to follow instructions

Stage 2: Reward Modeling (RM)
   Human comparisons: "response A is better than B"
   β†’ Reward model learns human preferences

Stage 3: RLHF / PPO / DPO
   SFT model optimized to maximize reward model score
   β†’ Model generates responses humans prefer

Modern Alternatives to RLHFΒΆ

Method

Year

Key Idea

Training

RLHF + PPO

2022

RL policy gradient

Complex, unstable

DPO

2023

Direct preference optimization

Simpler, no RL

Constitutional AI

2022

Self-critique with principles

Scalable supervision

GRPO

2024

Group relative policy opt.

DeepSeek’s method

ORPO

2024

Odds ratio preference opt.

Single-stage SFT+align

# Install dependencies
# !pip install transformers trl datasets peft accelerate bitsandbytes
# TRL (Transformer Reinforcement Learning) by HuggingFace covers all alignment methods

1. Stage 1: Supervised Fine-Tuning (SFT)ΒΆ

from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
from peft import LoraConfig

# Example SFT dataset format (instruction-response pairs)
sft_data = [
    {
        'messages': [
            {'role': 'user', 'content': 'What is the capital of France?'},
            {'role': 'assistant', 'content': 'The capital of France is Paris.'}
        ]
    },
    {
        'messages': [
            {'role': 'user', 'content': 'Write a Python function to reverse a string.'},
            {'role': 'assistant', 'content': 'def reverse_string(s: str) -> str:\n    return s[::-1]'}
        ]
    },
]

def run_sft(
    model_id: str,
    dataset: Dataset,
    output_dir: str = './sft_output',
    num_epochs: int = 1
) -> None:
    """Run supervised fine-tuning with QLoRA."""
    from transformers import BitsAndBytesConfig
    import torch
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map='auto'
    )
    
    peft_config = LoraConfig(
        r=16, lora_alpha=32,
        target_modules=['q_proj', 'v_proj'],
        lora_dropout=0.05, bias='none',
        task_type='CAUSAL_LM'
    )
    
    training_args = SFTConfig(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
    )
    
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        peft_config=peft_config,
        processing_class=tokenizer,
    )
    trainer.train()
    trainer.save_model()

print('SFT pipeline defined (see Phase 12 notebooks 02 and 04 for full walkthrough).')

2. Stage 2: Reward Model TrainingΒΆ

from trl import RewardTrainer, RewardConfig

# Reward model dataset format:
# For each prompt, human annotators rank multiple responses
# "chosen" = preferred response, "rejected" = dispreferred
reward_data_example = [
    {
        'prompt': 'How do I improve my writing?',
        'chosen': 'Practice daily journaling, read widely across genres, and seek feedback from others. Focus on clarity: say exactly what you mean with as few words as possible.',
        'rejected': 'Write more.'
    },
    {
        'prompt': 'What is 2 + 2?',
        'chosen': '4.',
        'rejected': '5.'
    },
]

def train_reward_model(
    base_model_id: str,
    preference_dataset,
    output_dir: str = './reward_model'
) -> None:
    """
    Train a reward model from preference data.
    The reward model outputs a scalar score for any (prompt, response) pair.
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    # Use a classification head on top of LLM
    from transformers import AutoModelForSequenceClassification
    model = AutoModelForSequenceClassification.from_pretrained(
        base_model_id,
        num_labels=1  # Scalar reward score
    )
    
    config = RewardConfig(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=8,
        learning_rate=1e-5,
    )
    
    trainer = RewardTrainer(
        model=model,
        args=config,
        train_dataset=preference_dataset,
        processing_class=tokenizer,
    )
    trainer.train()

print('Reward model training pipeline defined.')
print()
print('Key insight: The reward model maps (prompt + response) β†’ scalar.')
print('Higher score = more aligned with human preferences.')

3. Stage 3a: PPO (Proximal Policy Optimization)ΒΆ

from trl import PPOTrainer, PPOConfig

# PPO training loop:
# 1. Actor (SFT model) generates responses
# 2. Reward model scores them
# 3. PPO updates actor to maximize reward
# 4. KL divergence penalty prevents drifting too far from SFT model

def ppo_training_loop_pseudocode():
    """
    Conceptual PPO training loop for RLHF.
    In practice, use trl.PPOTrainer which handles all the details.
    """
    # Setup:
    # actor = sft_model (policy to be optimized)
    # reference = frozen copy of sft_model (KL reference)
    # reward_model = trained reward model
    # critic = value network (estimates baseline reward)
    
    for batch in training_prompts:
        # Step 1: Generate responses with actor
        responses = actor.generate(batch['prompts'], do_sample=True)
        
        # Step 2: Score with reward model
        rewards = reward_model(batch['prompts'], responses)
        
        # Step 3: KL penalty (prevents reward hacking)
        actor_logprobs = actor.log_probs(responses)
        ref_logprobs = reference.log_probs(responses)
        kl_penalty = actor_logprobs - ref_logprobs  # penalize divergence
        adjusted_rewards = rewards - 0.1 * kl_penalty  # Ξ²=0.1 is typical
        
        # Step 4: PPO update (clipped surrogate objective)
        actor.ppo_step(adjusted_rewards)

print('PPO RLHF conceptual loop explained.')
print()
print('Why PPO is hard:')
print('  β€’ Requires 4 models in memory: actor, reference, reward, critic')
print('  β€’ Unstable training β€” KL coefficient needs careful tuning')
print('  β€’ Reward hacking: model learns to fool reward model')
print('  β€’ β†’ DPO was invented to avoid all of this')

4. Stage 3b: DPO (Direct Preference Optimization) β€” The Simpler WayΒΆ

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig, get_peft_model
import torch

# DPO eliminates the need for a separate reward model.
# It directly optimizes the policy on preference data.
# Loss: log Οƒ(Ξ² * (log Ο€(chosen|x)/Ο€_ref(chosen|x) - log Ο€(rejected|x)/Ο€_ref(rejected|x)))
# β†’ Increase probability of chosen, decrease probability of rejected

def run_dpo(
    sft_model_path: str,
    preference_dataset,
    output_dir: str = './dpo_output',
    beta: float = 0.1  # KL coefficient β€” higher = more conservative
) -> None:
    """
    DPO training β€” no reward model needed!
    beta: temperature for the DPO loss (0.1 is standard)
    """
    tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
    model = AutoModelForCausalLM.from_pretrained(
        sft_model_path,
        torch_dtype=torch.bfloat16,
        device_map='auto'
    )
    
    peft_config = LoraConfig(
        r=64, lora_alpha=16,
        target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        bias='none', task_type='CAUSAL_LM'
    )
    model = get_peft_model(model, peft_config)
    
    dpo_config = DPOConfig(
        output_dir=output_dir,
        beta=beta,
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=5e-7,  # Very small LR for DPO
        bf16=True,
    )
    
    trainer = DPOTrainer(
        model=model,
        args=dpo_config,
        train_dataset=preference_dataset,
        processing_class=tokenizer,
    )
    trainer.train()

print('DPO training pipeline defined.')
print()
print('DPO vs PPO:')
print('  DPO: simpler, stable, no reward model needed β†’ recommended for most cases')
print('  PPO: more flexible but requires 4 models and careful hypertuning')

5. Constitutional AI β€” Self-Supervised AlignmentΒΆ

from openai import OpenAI

client = OpenAI()

# Constitutional AI (Anthropic, 2022):
# Instead of expensive human feedback, use principles + AI critique
# 1. Generate initial response
# 2. Critique it against a "constitution" of principles
# 3. Revise based on the critique
# 4. Use revised responses as SFT data or preference pairs

CONSTITUTION = [
    'Please choose the response that is the most helpful, harmless, and honest.',
    'Please choose the response that best avoids discrimination based on race, gender, or religion.',
    'Please choose the response that is most truthful and does not spread misinformation.',
    'Please choose the response that does not assist with illegal activities.',
    'Please choose the response that respects individual privacy.',
]

def constitutional_critique_and_revise(prompt: str, initial_response: str, principle: str) -> str:
    """
    Critique a response against a principle, then generate a better version.
    This is the core loop of Constitutional AI.
    """
    # Step 1: Critique
    critique_prompt = f"""Human: {prompt}

Assistant: {initial_response}

Critique: Identify any ways in which the above response violates this principle:
"{principle}"
If there are no violations, say "No violations found."""

    critique = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': critique_prompt}],
        max_tokens=300
    ).choices[0].message.content
    
    # Step 2: Revise
    revision_prompt = f"""Human: {prompt}

Assistant: {initial_response}

Critique: {critique}

Revision: Please rewrite the assistant response to fix the issues identified in the critique."""

    revised = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': revision_prompt}],
        max_tokens=500
    ).choices[0].message.content
    
    return revised

def constitutional_ai_pipeline(prompt: str, model: str = 'gpt-4o-mini') -> dict:
    """
    Full Constitutional AI pipeline:
    1. Generate initial response
    2. Apply each constitutional principle
    3. Return progressively refined response
    """
    # Initial response
    initial = client.chat.completions.create(
        model=model,
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=500
    ).choices[0].message.content
    
    current = initial
    revisions = []
    
    # Apply 2 random principles (applying all is expensive)
    import random
    selected_principles = random.sample(CONSTITUTION, 2)
    
    for principle in selected_principles:
        revised = constitutional_critique_and_revise(prompt, current, principle)
        revisions.append({'principle': principle, 'revision': revised})
        current = revised
    
    return {'initial': initial, 'final': current, 'revisions': revisions}

# Demo
print('Constitutional AI pipeline defined.')
print()
print('This is how Anthropic trained Claude to be helpful, harmless, and honest:')
print('  1. Generate many (prompt, response) pairs')
print('  2. Critique and revise each with constitutional principles')
print('  3. Use original vs. revised as DPO preference pairs')
print('  4. Train the model on these AI-generated preferences')
print()
print('Key benefit: No expensive human labeling β€” scales automatically.')

6. Build Your Own ConstitutionΒΆ

# You can define domain-specific constitutions for your use case

CUSTOMER_SERVICE_CONSTITUTION = [
    'Choose the response that best resolves the customer\'s issue efficiently.',
    'Choose the response that is most empathetic and professional.',
    'Choose the response that follows company policy and does not make unauthorized promises.',
    'Choose the response that escalates appropriately when the issue is beyond scope.',
]

MEDICAL_AI_CONSTITUTION = [
    'Choose the response that always recommends consulting a licensed medical professional.',
    'Choose the response that does not provide specific dosage or treatment recommendations.',
    'Choose the response that presents information accurately with appropriate uncertainty.',
    'Choose the response that is inclusive and does not make assumptions about patient demographics.',
]

CODING_ASSISTANT_CONSTITUTION = [
    'Choose the response with code that follows security best practices.',
    'Choose the response that is most readable and maintainable.',
    'Choose the response that handles edge cases and errors appropriately.',
    'Choose the response that uses the simplest solution that solves the problem.',
]

print('Domain-specific constitutions examples:')
print()
for name, constitution in [
    ('Customer Service', CUSTOMER_SERVICE_CONSTITUTION),
    ('Medical AI', MEDICAL_AI_CONSTITUTION),
    ('Coding Assistant', CODING_ASSISTANT_CONSTITUTION)
]:
    print(f'{name}:')
    for principle in constitution[:2]:
        print(f'  - {principle[:70]}...')
    print()

Summary: Choose Your Alignment MethodΒΆ

Goal                                β†’ Method
─────────────────────────────────────────────────────────
Best quality, have GPU + data       β†’ PPO (original RLHF)
Simpler training, have preferences  β†’ DPO (recommended)
No human data available             β†’ Constitutional AI
Reasoning improvement               β†’ GRPO (see notebook 08)
Single-stage SFT + alignment        β†’ ORPO

ExercisesΒΆ

  1. Create a 20-example preference dataset for a specific domain (cooking, coding, etc.).

  2. Run the Constitutional AI pipeline on 5 prompts and compare initial vs. revised responses.

  3. Implement a simple reward model that scores responses on helpfulness (1-10 scale).

  4. Run DPO fine-tuning on a small model (GPT-2) with your preference dataset.

  5. Design a constitution for an AI tutor that avoids giving direct answers to homework.