BERT Text Classification: Fine-Tuning Transformers on Your Own DataΒΆ

Pre-trained BERT embeddings + a simple classifier head = state-of-the-art text classification in hours, not weeks. This notebook covers tokenization, the HuggingFace Trainer API, custom datasets, and evaluation β€” the full production workflow.

# Install dependencies if needed
# !pip install transformers datasets accelerate scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Check if transformers is available
try:
    from transformers import (
        AutoTokenizer, AutoModelForSequenceClassification,
        TrainingArguments, Trainer, DataCollatorWithPadding
    )
    from datasets import Dataset
    import torch
    HAS_TRANSFORMERS = True
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Transformers available. Device: {device}')
except ImportError:
    HAS_TRANSFORMERS = False
    print('transformers not installed β€” showing code patterns with simulated outputs')

# Synthetic sentiment dataset (movie reviews)
np.random.seed(42)

positive_reviews = [
    'An absolute masterpiece. The performances were outstanding and the story kept me on the edge of my seat.',
    'Brilliant direction and stunning visuals. One of the best films of the decade.',
    'The characters felt real and the plot was gripping from start to finish.',
    'Emotionally powerful and beautifully shot. A must-see for any cinema lover.',
    'Witty, sharp, and thoroughly entertaining. I laughed throughout the entire film.',
    'A triumph of storytelling. The screenplay is tight, the acting superb.',
    'Genuinely moving. I left the theater feeling uplifted and inspired.',
    'The best action film in years. Non-stop thrills with a surprisingly deep plot.',
    'Exquisitely crafted. Every scene serves a purpose. Highly recommended.',
    'Delightfully funny and surprisingly touching. A rare crowd-pleaser that actually delivers.',
]

negative_reviews = [
    'A total disappointment. The plot made no sense and the acting was wooden.',
    'Boring from start to finish. I nearly fell asleep twice in the theater.',
    'Terrible dialogue and lazy writing. A waste of two hours.',
    'The CGI was distracting and the story completely predictable.',
    'I expected much more from this director. The film was a confusing mess.',
    'Painfully slow pacing and characters I could not care less about.',
    'Rushed ending that undoes everything the first two acts built.',
    'Derivative and unoriginal. Nothing here that has not been done better elsewhere.',
    'The worst film I have seen this year. Poorly edited and structurally incoherent.',
    'Missed every opportunity it had. The premise was interesting; the execution was not.',
]

# Expand to a realistic size by paraphrasing variations
def augment_reviews(reviews, n=200):
    augmented = []
    prefixes = ['', 'Overall, ', 'Honestly, ', 'In short, ', 'To summarize: ']
    suffixes = ['', ' Highly recommended.', ' Would not recommend.', ' See it.', ' Skip it.']
    for i in range(n):
        base = reviews[i % len(reviews)]
        pre = prefixes[i % len(prefixes)]
        suf = suffixes[(i // len(prefixes)) % len(suffixes)]
        augmented.append(pre + base + suf)
    return augmented

pos = augment_reviews(positive_reviews, 200)
neg = augment_reviews(negative_reviews, 200)

texts  = pos + neg
labels = [1] * len(pos) + [0] * len(neg)

print(f'Dataset: {len(texts)} reviews, balanced: {sum(labels)}/{len(labels)-sum(labels)}')
print(f'Sample positive: {pos[0][:80]}...')
print(f'Sample negative: {neg[0][:80]}...')

1. Tokenization β€” What BERT Actually SeesΒΆ

# Understanding tokenization before training

if HAS_TRANSFORMERS:
    MODEL_NAME = 'distilbert-base-uncased'  # Smaller, faster than bert-base
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    sample = 'The film was absolutely brilliant!'
    tokens = tokenizer(sample, return_tensors='pt')
    
    print(f'Input text: {sample}')
    print(f'Input IDs: {tokens["input_ids"].tolist()}')
    print(f'Token strings: {tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])}')
    print(f'Attention mask: {tokens["attention_mask"].tolist()}')
    print(f'Number of tokens: {tokens["input_ids"].shape[1]}')
    
    # BERT special tokens
    print('\nSpecial tokens:')
    print(f'  [CLS] = {tokenizer.cls_token} (id={tokenizer.cls_token_id})')
    print(f'  [SEP] = {tokenizer.sep_token} (id={tokenizer.sep_token_id})')
    print(f'  [PAD] = {tokenizer.pad_token} (id={tokenizer.pad_token_id})')
    print(f'  vocab size = {tokenizer.vocab_size:,}')
else:
    print('Tokenization output (simulated):')
    print('Input: "The film was absolutely brilliant!"')
    print('Tokens: [CLS] the film was absolutely brilliant ! [SEP]')
    print('IDs:   [101, 1996, 2143, 2001, 7078, 11560, 999, 102]')
    print('\n[CLS] = classification token (pooled for classification tasks)')
    print('[SEP] = sentence separator')
    print('WordPiece: "absolutely" β†’ ["absolute", "##ly"] for rare/OOV words')
# Batch tokenization with padding & truncation
if HAS_TRANSFORMERS:
    batch = tokenizer(
        texts[:5],
        padding=True,          # Pad shorter sequences
        truncation=True,       # Truncate longer than max_length
        max_length=128,        # BERT max is 512, use 128 for speed
        return_tensors='pt'
    )
    print(f'Batch input shape: {batch["input_ids"].shape}')  # (5, 128)
    print(f'Non-padding tokens in seq 0: {batch["attention_mask"][0].sum().item()}')
    print(f'Non-padding tokens in seq 4: {batch["attention_mask"][4].sum().item()}')
    print('\n[Padding makes all sequences same length]')
    print('[Attention mask = 0 for padding tokens β€” BERT ignores them]')
else:
    print('Batch shape: (5, 128)')
    print('padding=True β†’ all sequences padded to longest in batch')
    print('truncation=True β†’ sequences longer than max_length are cut')
    print('attention_mask=0 for [PAD] tokens β†’ transformer ignores padding')

2. Dataset Preparation β€” HuggingFace Dataset FormatΒΆ

# Split data first, THEN tokenize
# (tokenizer.fit() doesn't exist β€” it's pretrained β€” but this pattern matters for encoders/scalers)
X_train, X_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print(f'Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}')

if HAS_TRANSFORMERS:
    MODEL_NAME = 'distilbert-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            padding=False,     # DataCollatorWithPadding handles this per-batch
            truncation=True,
            max_length=128,
        )
    
    # Create HuggingFace Dataset objects
    train_ds = Dataset.from_dict({'text': X_train, 'label': y_train})
    val_ds   = Dataset.from_dict({'text': X_val,   'label': y_val})
    test_ds  = Dataset.from_dict({'text': X_test,  'label': y_test})
    
    # Tokenize all splits
    train_ds = train_ds.map(tokenize_function, batched=True, remove_columns=['text'])
    val_ds   = val_ds.map(tokenize_function, batched=True, remove_columns=['text'])
    test_ds  = test_ds.map(tokenize_function, batched=True, remove_columns=['text'])
    
    train_ds.set_format('torch')
    val_ds.set_format('torch')
    test_ds.set_format('torch')
    
    print('\nDataset features:', train_ds.features)
else:
    print('HuggingFace Dataset object:')
    print("  Dataset.from_dict({'text': X_train, 'label': y_train})")
    print('  β†’ Efficient, handles batched mapping, caches to disk')
    print('  β†’ .map() applies tokenization lazily with batching')
    print('  β†’ DataCollatorWithPadding pads per batch (memory efficient)')

3. Model Setup β€” BERT + Classification HeadΒΆ

# AutoModelForSequenceClassification adds a linear layer on top of BERT
# Architecture: BERT β†’ [CLS] pooled output β†’ Dropout β†’ Linear(768, num_labels)

if HAS_TRANSFORMERS:
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=2,
        id2label={0: 'NEGATIVE', 1: 'POSITIVE'},
        label2id={'NEGATIVE': 0, 'POSITIVE': 1},
    )
    model.to(device)
    
    # Parameter count
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f'Total parameters: {total_params:,}')
    print(f'Trainable parameters: {trainable_params:,}')
    print(f'\nModel architecture:')
    print(model.classifier)
else:
    print('DistilBERT architecture:')
    print('  DistilBert backbone: 66M parameters')
    print('  β†’ 6 transformer layers (half of BERT-base)')
    print('  β†’ 768-dim hidden states')
    print('  pre_classifier: Linear(768, 768) + ReLU + Dropout')
    print('  classifier: Linear(768, 2)  ← our new classification head')
    print()
    print('Fine-tuning strategy:')
    print('  ALL weights are updated (including BERT backbone)')
    print('  Use small LR (2e-5) to avoid "catastrophic forgetting"')
    print('  Freeze backbone first, then unfreeze = 2-stage fine-tuning')

4. Training with HuggingFace TrainerΒΆ

import numpy as np
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': f1_score(labels, preds, average='weighted'),
    }

if HAS_TRANSFORMERS:
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    training_args = TrainingArguments(
        output_dir='./bert_sentiment',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        learning_rate=2e-5,           # Key hyperparameter for fine-tuning
        weight_decay=0.01,            # L2 regularization
        warmup_ratio=0.1,             # LR warmup = 10% of total steps
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='f1',
        logging_dir='./logs',
        logging_steps=50,
        report_to='none',             # Disable W&B logging
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
    
    print('Starting training...')
    trainer.train()
else:
    print('Trainer output (simulated β€” 3 epochs on 280 examples):')
    print()
    print('Epoch | Train Loss | Val Loss | Accuracy |   F1')
    print('------+------------+----------+----------+------')
    print('  1   |   0.6931   |  0.4821  |  0.8167  | 0.815')
    print('  2   |   0.3214   |  0.2341  |  0.9167  | 0.916')
    print('  3   |   0.1823   |  0.1920  |  0.9500  | 0.950')
    print()
    print('TrainingArguments key choices:')
    print('  learning_rate=2e-5  β†’ standard for BERT fine-tuning')
    print('  warmup_ratio=0.1    β†’ avoids large updates in early steps')
    print('  weight_decay=0.01   β†’ regularization (AdamW style)')
    print('  load_best_model_at_end=True β†’ restore checkpoint with best val F1')

5. Evaluation & InferenceΒΆ

if HAS_TRANSFORMERS:
    # Evaluate on test set
    results = trainer.evaluate(test_ds)
    print('Test set results:', results)
    
    # Get predictions for confusion matrix
    predictions = trainer.predict(test_ds)
    y_pred = np.argmax(predictions.predictions, axis=-1)
    
    print('\nClassification Report:')
    print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
else:
    # Simulate realistic results
    from sklearn.linear_model import LogisticRegression
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Compare BERT-like performance vs TF-IDF baseline
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    X_tr_tfidf = tfidf.fit_transform(X_train)
    X_te_tfidf = tfidf.transform(X_test)
    
    lr = LogisticRegression(max_iter=1000)
    lr.fit(X_tr_tfidf, y_train)
    y_pred_lr = lr.predict(X_te_tfidf)
    
    print('TF-IDF + LogReg (baseline):')
    print(classification_report(y_test, y_pred_lr, target_names=['Negative', 'Positive']))
    print()
    print('Expected BERT performance (simulated):')
    print('              precision    recall  f1-score   support')
    print('    Negative       0.96      0.94      0.95        30')
    print('    Positive       0.94      0.97      0.95        30')
    print('    accuracy                           0.95        60')

# Confusion matrix visualization
fig, ax = plt.subplots(figsize=(6, 5))
if HAS_TRANSFORMERS:
    cm = confusion_matrix(y_test, y_pred)
else:
    cm = confusion_matrix(y_test, y_pred_lr)
    
import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
plt.tight_layout()
plt.show()

6. Inference Pipeline β€” Production-Ready PredictionΒΆ

# Using HuggingFace pipeline for easy inference
if HAS_TRANSFORMERS:
    from transformers import pipeline
    
    # Save and reload as pipeline
    trainer.save_model('./bert_sentiment_final')
    
    classifier = pipeline(
        'text-classification',
        model='./bert_sentiment_final',
        tokenizer=tokenizer,
        device=0 if device == 'cuda' else -1,
    )
    
    test_sentences = [
        'This movie was an absolute delight!',
        'I want my money back. Terrible.',
        'Not bad, but not great either.',  # Ambiguous
    ]
    results = classifier(test_sentences)
    for text, result in zip(test_sentences, results):
        print(f'{result["label"]:8s} ({result["score"]:.2%}) | {text}')
else:
    print('Production inference pattern:')
    print()
    print("classifier = pipeline('text-classification', model='./bert_sentiment_final')")
    print()
    print('Results:')
    print('POSITIVE (97.3%) | This movie was an absolute delight!')
    print('NEGATIVE (95.8%) | I want my money back. Terrible.')
    print('NEGATIVE (54.2%) | Not bad, but not great either.  ← low confidence = ambiguous')
    print()
    print('Production considerations:')
    print('  - Threshold tuning: predict UNKNOWN when confidence < 0.7')
    print('  - Batch inference: classifier(texts, batch_size=64)')
    print('  - ONNX export: 2-3x faster CPU inference')
    print('  - Quantization: INT8 β†’ 4x smaller, ~1.5x faster')

BERT Fine-Tuning Cheat SheetΒΆ

Task                  β†’ Model to Use
──────────────────────────────────────────────────────────
Sentiment / intent    β†’ AutoModelForSequenceClassification
NER / POS tagging     β†’ AutoModelForTokenClassification
Q&A extraction        β†’ AutoModelForQuestionAnswering
Text generation       β†’ AutoModelForCausalLM (GPT family)
Masked fill           β†’ AutoModelForMaskedLM (BERT family)

Hyperparameter   Typical Range    Notes
───────────────────────────────────────────────────
learning_rate    1e-5 to 5e-5     2e-5 is usually safe
batch_size       8 to 32          32 if VRAM allows
epochs           2 to 5           Overfit risk with more
warmup_ratio     0.06 to 0.1      10% of total steps
max_length       64 to 512        128 for most classif.

When Fine-tuning Fails:
  - Loss not decreasing β†’ LR too low or too high
  - Val loss >> train loss β†’ more data or stronger dropout
  - NaN loss β†’ LR too high; use gradient clipping
  - Slow on CPU β†’ use distilbert (40% faster than bert-base)

ExercisesΒΆ

  1. Replace DistilBERT with roberta-base β€” does accuracy improve?

  2. Implement early stopping using EarlyStoppingCallback from transformers.

  3. Try freezing the BERT backbone for epoch 1, then unfreeze all layers for epochs 2-3 (two-stage fine-tuning).

  4. Export the model to ONNX and compare inference latency vs. the HuggingFace pipeline.

  5. Add a third class 'NEUTRAL' to the dataset and re-train β€” how does multi-class accuracy compare?