BERT Text Classification: Fine-Tuning Transformers on Your Own DataΒΆ
Pre-trained BERT embeddings + a simple classifier head = state-of-the-art text classification in hours, not weeks. This notebook covers tokenization, the HuggingFace Trainer API, custom datasets, and evaluation β the full production workflow.
# Install dependencies if needed
# !pip install transformers datasets accelerate scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
# Check if transformers is available
try:
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset
import torch
HAS_TRANSFORMERS = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Transformers available. Device: {device}')
except ImportError:
HAS_TRANSFORMERS = False
print('transformers not installed β showing code patterns with simulated outputs')
# Synthetic sentiment dataset (movie reviews)
np.random.seed(42)
positive_reviews = [
'An absolute masterpiece. The performances were outstanding and the story kept me on the edge of my seat.',
'Brilliant direction and stunning visuals. One of the best films of the decade.',
'The characters felt real and the plot was gripping from start to finish.',
'Emotionally powerful and beautifully shot. A must-see for any cinema lover.',
'Witty, sharp, and thoroughly entertaining. I laughed throughout the entire film.',
'A triumph of storytelling. The screenplay is tight, the acting superb.',
'Genuinely moving. I left the theater feeling uplifted and inspired.',
'The best action film in years. Non-stop thrills with a surprisingly deep plot.',
'Exquisitely crafted. Every scene serves a purpose. Highly recommended.',
'Delightfully funny and surprisingly touching. A rare crowd-pleaser that actually delivers.',
]
negative_reviews = [
'A total disappointment. The plot made no sense and the acting was wooden.',
'Boring from start to finish. I nearly fell asleep twice in the theater.',
'Terrible dialogue and lazy writing. A waste of two hours.',
'The CGI was distracting and the story completely predictable.',
'I expected much more from this director. The film was a confusing mess.',
'Painfully slow pacing and characters I could not care less about.',
'Rushed ending that undoes everything the first two acts built.',
'Derivative and unoriginal. Nothing here that has not been done better elsewhere.',
'The worst film I have seen this year. Poorly edited and structurally incoherent.',
'Missed every opportunity it had. The premise was interesting; the execution was not.',
]
# Expand to a realistic size by paraphrasing variations
def augment_reviews(reviews, n=200):
augmented = []
prefixes = ['', 'Overall, ', 'Honestly, ', 'In short, ', 'To summarize: ']
suffixes = ['', ' Highly recommended.', ' Would not recommend.', ' See it.', ' Skip it.']
for i in range(n):
base = reviews[i % len(reviews)]
pre = prefixes[i % len(prefixes)]
suf = suffixes[(i // len(prefixes)) % len(suffixes)]
augmented.append(pre + base + suf)
return augmented
pos = augment_reviews(positive_reviews, 200)
neg = augment_reviews(negative_reviews, 200)
texts = pos + neg
labels = [1] * len(pos) + [0] * len(neg)
print(f'Dataset: {len(texts)} reviews, balanced: {sum(labels)}/{len(labels)-sum(labels)}')
print(f'Sample positive: {pos[0][:80]}...')
print(f'Sample negative: {neg[0][:80]}...')
1. Tokenization β What BERT Actually SeesΒΆ
# Understanding tokenization before training
if HAS_TRANSFORMERS:
MODEL_NAME = 'distilbert-base-uncased' # Smaller, faster than bert-base
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
sample = 'The film was absolutely brilliant!'
tokens = tokenizer(sample, return_tensors='pt')
print(f'Input text: {sample}')
print(f'Input IDs: {tokens["input_ids"].tolist()}')
print(f'Token strings: {tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])}')
print(f'Attention mask: {tokens["attention_mask"].tolist()}')
print(f'Number of tokens: {tokens["input_ids"].shape[1]}')
# BERT special tokens
print('\nSpecial tokens:')
print(f' [CLS] = {tokenizer.cls_token} (id={tokenizer.cls_token_id})')
print(f' [SEP] = {tokenizer.sep_token} (id={tokenizer.sep_token_id})')
print(f' [PAD] = {tokenizer.pad_token} (id={tokenizer.pad_token_id})')
print(f' vocab size = {tokenizer.vocab_size:,}')
else:
print('Tokenization output (simulated):')
print('Input: "The film was absolutely brilliant!"')
print('Tokens: [CLS] the film was absolutely brilliant ! [SEP]')
print('IDs: [101, 1996, 2143, 2001, 7078, 11560, 999, 102]')
print('\n[CLS] = classification token (pooled for classification tasks)')
print('[SEP] = sentence separator')
print('WordPiece: "absolutely" β ["absolute", "##ly"] for rare/OOV words')
# Batch tokenization with padding & truncation
if HAS_TRANSFORMERS:
batch = tokenizer(
texts[:5],
padding=True, # Pad shorter sequences
truncation=True, # Truncate longer than max_length
max_length=128, # BERT max is 512, use 128 for speed
return_tensors='pt'
)
print(f'Batch input shape: {batch["input_ids"].shape}') # (5, 128)
print(f'Non-padding tokens in seq 0: {batch["attention_mask"][0].sum().item()}')
print(f'Non-padding tokens in seq 4: {batch["attention_mask"][4].sum().item()}')
print('\n[Padding makes all sequences same length]')
print('[Attention mask = 0 for padding tokens β BERT ignores them]')
else:
print('Batch shape: (5, 128)')
print('padding=True β all sequences padded to longest in batch')
print('truncation=True β sequences longer than max_length are cut')
print('attention_mask=0 for [PAD] tokens β transformer ignores padding')
2. Dataset Preparation β HuggingFace Dataset FormatΒΆ
# Split data first, THEN tokenize
# (tokenizer.fit() doesn't exist β it's pretrained β but this pattern matters for encoders/scalers)
X_train, X_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
print(f'Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}')
if HAS_TRANSFORMERS:
MODEL_NAME = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize_function(examples):
return tokenizer(
examples['text'],
padding=False, # DataCollatorWithPadding handles this per-batch
truncation=True,
max_length=128,
)
# Create HuggingFace Dataset objects
train_ds = Dataset.from_dict({'text': X_train, 'label': y_train})
val_ds = Dataset.from_dict({'text': X_val, 'label': y_val})
test_ds = Dataset.from_dict({'text': X_test, 'label': y_test})
# Tokenize all splits
train_ds = train_ds.map(tokenize_function, batched=True, remove_columns=['text'])
val_ds = val_ds.map(tokenize_function, batched=True, remove_columns=['text'])
test_ds = test_ds.map(tokenize_function, batched=True, remove_columns=['text'])
train_ds.set_format('torch')
val_ds.set_format('torch')
test_ds.set_format('torch')
print('\nDataset features:', train_ds.features)
else:
print('HuggingFace Dataset object:')
print(" Dataset.from_dict({'text': X_train, 'label': y_train})")
print(' β Efficient, handles batched mapping, caches to disk')
print(' β .map() applies tokenization lazily with batching')
print(' β DataCollatorWithPadding pads per batch (memory efficient)')
3. Model Setup β BERT + Classification HeadΒΆ
# AutoModelForSequenceClassification adds a linear layer on top of BERT
# Architecture: BERT β [CLS] pooled output β Dropout β Linear(768, num_labels)
if HAS_TRANSFORMERS:
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=2,
id2label={0: 'NEGATIVE', 1: 'POSITIVE'},
label2id={'NEGATIVE': 0, 'POSITIVE': 1},
)
model.to(device)
# Parameter count
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Total parameters: {total_params:,}')
print(f'Trainable parameters: {trainable_params:,}')
print(f'\nModel architecture:')
print(model.classifier)
else:
print('DistilBERT architecture:')
print(' DistilBert backbone: 66M parameters')
print(' β 6 transformer layers (half of BERT-base)')
print(' β 768-dim hidden states')
print(' pre_classifier: Linear(768, 768) + ReLU + Dropout')
print(' classifier: Linear(768, 2) β our new classification head')
print()
print('Fine-tuning strategy:')
print(' ALL weights are updated (including BERT backbone)')
print(' Use small LR (2e-5) to avoid "catastrophic forgetting"')
print(' Freeze backbone first, then unfreeze = 2-stage fine-tuning')
4. Training with HuggingFace TrainerΒΆ
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
'accuracy': accuracy_score(labels, preds),
'f1': f1_score(labels, preds, average='weighted'),
}
if HAS_TRANSFORMERS:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
output_dir='./bert_sentiment',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5, # Key hyperparameter for fine-tuning
weight_decay=0.01, # L2 regularization
warmup_ratio=0.1, # LR warmup = 10% of total steps
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='f1',
logging_dir='./logs',
logging_steps=50,
report_to='none', # Disable W&B logging
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
print('Starting training...')
trainer.train()
else:
print('Trainer output (simulated β 3 epochs on 280 examples):')
print()
print('Epoch | Train Loss | Val Loss | Accuracy | F1')
print('------+------------+----------+----------+------')
print(' 1 | 0.6931 | 0.4821 | 0.8167 | 0.815')
print(' 2 | 0.3214 | 0.2341 | 0.9167 | 0.916')
print(' 3 | 0.1823 | 0.1920 | 0.9500 | 0.950')
print()
print('TrainingArguments key choices:')
print(' learning_rate=2e-5 β standard for BERT fine-tuning')
print(' warmup_ratio=0.1 β avoids large updates in early steps')
print(' weight_decay=0.01 β regularization (AdamW style)')
print(' load_best_model_at_end=True β restore checkpoint with best val F1')
5. Evaluation & InferenceΒΆ
if HAS_TRANSFORMERS:
# Evaluate on test set
results = trainer.evaluate(test_ds)
print('Test set results:', results)
# Get predictions for confusion matrix
predictions = trainer.predict(test_ds)
y_pred = np.argmax(predictions.predictions, axis=-1)
print('\nClassification Report:')
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
else:
# Simulate realistic results
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
# Compare BERT-like performance vs TF-IDF baseline
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_tr_tfidf = tfidf.fit_transform(X_train)
X_te_tfidf = tfidf.transform(X_test)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_tr_tfidf, y_train)
y_pred_lr = lr.predict(X_te_tfidf)
print('TF-IDF + LogReg (baseline):')
print(classification_report(y_test, y_pred_lr, target_names=['Negative', 'Positive']))
print()
print('Expected BERT performance (simulated):')
print(' precision recall f1-score support')
print(' Negative 0.96 0.94 0.95 30')
print(' Positive 0.94 0.97 0.95 30')
print(' accuracy 0.95 60')
# Confusion matrix visualization
fig, ax = plt.subplots(figsize=(6, 5))
if HAS_TRANSFORMERS:
cm = confusion_matrix(y_test, y_pred)
else:
cm = confusion_matrix(y_test, y_pred_lr)
import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
xticklabels=['Negative', 'Positive'],
yticklabels=['Negative', 'Positive'])
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
plt.tight_layout()
plt.show()
6. Inference Pipeline β Production-Ready PredictionΒΆ
# Using HuggingFace pipeline for easy inference
if HAS_TRANSFORMERS:
from transformers import pipeline
# Save and reload as pipeline
trainer.save_model('./bert_sentiment_final')
classifier = pipeline(
'text-classification',
model='./bert_sentiment_final',
tokenizer=tokenizer,
device=0 if device == 'cuda' else -1,
)
test_sentences = [
'This movie was an absolute delight!',
'I want my money back. Terrible.',
'Not bad, but not great either.', # Ambiguous
]
results = classifier(test_sentences)
for text, result in zip(test_sentences, results):
print(f'{result["label"]:8s} ({result["score"]:.2%}) | {text}')
else:
print('Production inference pattern:')
print()
print("classifier = pipeline('text-classification', model='./bert_sentiment_final')")
print()
print('Results:')
print('POSITIVE (97.3%) | This movie was an absolute delight!')
print('NEGATIVE (95.8%) | I want my money back. Terrible.')
print('NEGATIVE (54.2%) | Not bad, but not great either. β low confidence = ambiguous')
print()
print('Production considerations:')
print(' - Threshold tuning: predict UNKNOWN when confidence < 0.7')
print(' - Batch inference: classifier(texts, batch_size=64)')
print(' - ONNX export: 2-3x faster CPU inference')
print(' - Quantization: INT8 β 4x smaller, ~1.5x faster')
BERT Fine-Tuning Cheat SheetΒΆ
Task β Model to Use
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sentiment / intent β AutoModelForSequenceClassification
NER / POS tagging β AutoModelForTokenClassification
Q&A extraction β AutoModelForQuestionAnswering
Text generation β AutoModelForCausalLM (GPT family)
Masked fill β AutoModelForMaskedLM (BERT family)
Hyperparameter Typical Range Notes
βββββββββββββββββββββββββββββββββββββββββββββββββββ
learning_rate 1e-5 to 5e-5 2e-5 is usually safe
batch_size 8 to 32 32 if VRAM allows
epochs 2 to 5 Overfit risk with more
warmup_ratio 0.06 to 0.1 10% of total steps
max_length 64 to 512 128 for most classif.
When Fine-tuning Fails:
- Loss not decreasing β LR too low or too high
- Val loss >> train loss β more data or stronger dropout
- NaN loss β LR too high; use gradient clipping
- Slow on CPU β use distilbert (40% faster than bert-base)
ExercisesΒΆ
Replace DistilBERT with
roberta-baseβ does accuracy improve?Implement early stopping using
EarlyStoppingCallbackfrom transformers.Try freezing the BERT backbone for epoch 1, then unfreeze all layers for epochs 2-3 (two-stage fine-tuning).
Export the model to ONNX and compare inference latency vs. the HuggingFace pipeline.
Add a third class
'NEUTRAL'to the dataset and re-train β how does multi-class accuracy compare?