Text Preprocessing: From Raw Text to FeaturesΒΆ

NLP starts with cleaning and representing text. This notebook covers the full pipeline: normalization, tokenization, stop words, stemming vs lemmatization, TF-IDF, and spaCy β€” with benchmarks showing what actually helps model performance.

# !pip install spacy scikit-learn nltk
# !python -m spacy download en_core_web_sm

import re
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

# Sample text corpus (news headlines)
texts = [
    'The Federal Reserve raises interest rates by 25 basis points amid inflation concerns.',
    'Apple Inc. reports record quarterly earnings, beating analyst expectations.',
    'Scientists discover new exoplanet in habitable zone of nearby star system.',
    'Congress passes bipartisan infrastructure bill after months of negotiations.',
    'Tesla shares surge 8% following better-than-expected delivery numbers.',
    'WHO warns of new COVID-19 variant spreading rapidly across Europe.',
    'Amazon acquires MGM Studios for $8.45 billion in landmark deal.',
    'Climate summit ends with agreement to reduce carbon emissions by 45% by 2030.',
    'SpaceX successfully launches 60 Starlink satellites into low Earth orbit.',
    'Pfizer announces new drug trial showing 90% efficacy against drug-resistant bacteria.',
]

print(f'Corpus: {len(texts)} documents')
print(f'Avg length: {np.mean([len(t.split()) for t in texts]):.1f} words')

1. Text Normalization β€” The FoundationΒΆ

def normalize_text(text: str, 
                   lowercase: bool = True,
                   remove_punctuation: bool = True,
                   remove_numbers: bool = False,
                   remove_extra_spaces: bool = True) -> str:
    """Basic text normalization pipeline."""
    if lowercase:
        text = text.lower()
    
    # Expand common contractions
    contractions = {
        "won't": 'will not', "can't": 'cannot', "n't": ' not',
        "'re": ' are', "'ve": ' have', "'ll": ' will', "'d": ' would',
    }
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    
    if remove_numbers:
        text = re.sub(r'\d+\.?\d*', '', text)
    
    if remove_punctuation:
        text = text.translate(str.maketrans('', '', string.punctuation))
    
    if remove_extra_spaces:
        text = re.sub(r'\s+', ' ', text).strip()
    
    return text

raw = "The Fed Won't raise rates by 25 basis-points... it's a big deal!"
print(f'Raw:          {raw}')
print(f'Normalized:   {normalize_text(raw)}')
print(f'No numbers:   {normalize_text(raw, remove_numbers=True)}')
print()

# Show effect on corpus
normalized_texts = [normalize_text(t) for t in texts]
print('Before vs after normalization:')
for orig, norm in zip(texts[:3], normalized_texts[:3]):
    print(f'  IN : {orig[:70]}')
    print(f'  OUT: {norm[:70]}')

2. Tokenization β€” Splitting Text into UnitsΒΆ

# Word tokenization: different strategies
text = 'It\'s a state-of-the-art, $1.5B deal β€” a game-changer for the industry!'

# 1. Whitespace split (naive)
print('1. Whitespace split:')
print(' ', text.split())

# 2. regex word tokenizer (better)
print('\n2. Regex word tokenizer:')
word_tokens = re.findall(r"\b[a-zA-Z']+\b", text)
print(' ', word_tokens)

# 3. Keeping numbers and hyphens
print('\n3. Alphanumeric + hyphens:')
full_tokens = re.findall(r"[a-zA-Z0-9$][a-zA-Z0-9.,'%-]*", text)
print(' ', full_tokens)

# N-gram tokenization
print('\n4. Bigrams from tokenized text:')
tokens = normalize_text(text).split()
bigrams = [f'{tokens[i]}_{tokens[i+1]}' for i in range(len(tokens)-1)]
print(' ', bigrams[:8])

# Subword tokenization (BPE preview)
print('\n5. Subword intuition (BPE/WordPiece):')
examples = ['unhappiness', 'rerunning', 'supercalifragilistic']
for word in examples:
    # Naive demonstration of breaking unknown words
    if len(word) > 8:
        parts = [word[:4], '##' + word[4:8], '##' + word[8:]]
    else:
        parts = [word[:4], '##' + word[4:]]
    print(f'  {word} β†’ {parts}')

3. Stop Words & Stemming vs LemmatizationΒΆ

# Stop words: high-frequency words that add little meaning
STOP_WORDS = {
    'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
    'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'shall',
    'should', 'may', 'might', 'must', 'can', 'could', 'to', 'of', 'in',
    'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into', 'through',
    'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
    'we', 'they', 'and', 'or', 'but', 'not', 'so', 'yet', 'both',
}

def remove_stopwords(tokens: list, stop_words: set = STOP_WORDS) -> list:
    return [t for t in tokens if t.lower() not in stop_words]

# Manual stemmer (simplified Porter algorithm intuition)
def naive_stem(word: str) -> str:
    """Very simplified stemming for demonstration."""
    suffixes = ['ing', 'tion', 'ness', 'ment', 'ed', 'er', 'ly', 'ies', 'es', 's']
    for suffix in suffixes:
        if word.endswith(suffix) and len(word) - len(suffix) >= 3:
            return word[:-len(suffix)]
    return word

# Compare stemming vs lemmatization
word_examples = [
    ('running', 'run'),      # stem=run, lemma=run
    ('better', 'good'),      # stem=bett, lemma=good ← lemmatizer wins
    ('caring', 'care'),      # stem=car, lemma=care
    ('studies', 'study'),    # stem=studi, lemma=study
    ('wolves', 'wolf'),      # stem=wolv, lemma=wolf ← lemmatizer wins
]

print(f'{"Word":<12} {"Naive Stem":<15} {"Correct Lemma":<15}')
print('-' * 42)
for word, lemma in word_examples:
    stem = naive_stem(word)
    print(f'{word:<12} {stem:<15} {lemma:<15}')

print()
print('Stemming:     Fast, rule-based, sometimes creates non-words ("studi", "wolv")')
print('Lemmatization: Slower, dictionary-based, always returns real words')
print('When to use stemming: high-volume IR tasks, speed > accuracy')
print('When to use lemma:   NLP tasks where word meaning matters')

# Demonstrate stop word removal effect
sample = 'the apple reports record earnings beating analyst expectations'
tokens = sample.split()
filtered = remove_stopwords(tokens)
print(f'\nBefore stop words: {tokens}')
print(f'After stop words:  {filtered}')

4. TF-IDF β€” The Workhorse of Text FeaturesΒΆ

# TF-IDF: Term Frequency Γ— Inverse Document Frequency
# TF(t, d) = count(t in d) / total_terms(d)
# IDF(t)   = log(N / df(t))   where df(t) = docs containing t
# High TF-IDF = word appears often in THIS doc but rarely across corpus

def tfidf_from_scratch(docs: list) -> tuple:
    """Manual TF-IDF for understanding the math."""
    tokenized = [d.lower().split() for d in docs]
    vocab = sorted(set(w for doc in tokenized for w in doc))
    word2idx = {w: i for i, w in enumerate(vocab)}
    N = len(docs)
    
    # Document frequency
    df = Counter(w for doc in tokenized for w in set(doc))
    
    tfidf_matrix = np.zeros((N, len(vocab)))
    for d_idx, doc in enumerate(tokenized):
        tf = Counter(doc)
        total = len(doc)
        for word, count in tf.items():
            if word in word2idx:
                tf_val = count / total
                idf_val = np.log(N / df[word])
                tfidf_matrix[d_idx, word2idx[word]] = tf_val * idf_val
    
    return tfidf_matrix, vocab

mini_corpus = [
    'apple releases new iphone model',
    'google announces new pixel phone model',
    'apple acquires music streaming company',
    'google enters electric vehicle market',
]

matrix, vocab = tfidf_from_scratch(mini_corpus)

# Show top TF-IDF words per document
for i, doc in enumerate(mini_corpus):
    top_words = sorted(zip(matrix[i], vocab), reverse=True)[:3]
    top_words = [(w, f'{s:.3f}') for s, w in top_words if s > 0]
    print(f'Doc {i+1}: {top_words}  | "{doc}"')

print()
print('"new" and "model" appear in 2+ docs β†’ lower IDF β†’ lower TF-IDF')
print('"iphone", "pixel", "streaming" appear once β†’ higher IDF β†’ higher TF-IDF')
# sklearn TF-IDF: production-ready with all the knobs
# Larger synthetic classification dataset
np.random.seed(42)

tech_news = [
    'AI startup raises $200M in Series B funding led by Sequoia',
    'Machine learning model predicts drug interactions with 94% accuracy',
    'OpenAI releases GPT-5 with improved reasoning capabilities',
    'Nvidia GPU shortages continue amid AI chip demand surge',
    'Python surpasses JavaScript as most popular programming language',
    'Quantum computing breakthrough achieves 1000-qubit processor',
    'Autonomous vehicle startup achieves full Level 4 certification',
    'Cloud computing revenue grows 35% year-over-year for major providers',
]
sport_news = [
    'Manchester United signs star striker for record transfer fee',
    'NBA playoffs: Warriors defeat Lakers in overtime thriller',
    'Serena Williams announces retirement from professional tennis',
    'Formula 1 champion breaks lap record at Monaco Grand Prix',
    'Olympics committee adds breakdancing to 2024 Paris Games lineup',
    'World Cup qualifying: Brazil defeats Argentina in South American clash',
    'Golf: Tiger Woods returns to competitive play after injury',
    'Cycling team banned following doping investigation',
]

corpus = tech_news * 5 + sport_news * 5  # 80 docs
labels = [1] * 40 + [0] * 40            # 1=tech, 0=sport

# Compare different TF-IDF configurations
configs = {
    'Unigrams':          TfidfVectorizer(ngram_range=(1,1), max_features=500),
    'Bigrams':           TfidfVectorizer(ngram_range=(1,2), max_features=500),
    'No stop words':     TfidfVectorizer(ngram_range=(1,1), stop_words='english', max_features=500),
    'Sublinear TF':      TfidfVectorizer(ngram_range=(1,1), sublinear_tf=True, max_features=500),
    'Char n-grams':      TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5), max_features=500),
}

print(f'{"Config":<20} {"Accuracy":<12} {"Dimensions"}')
print('-' * 45)
for name, vec in configs.items():
    X = vec.fit_transform(corpus)
    scores = cross_val_score(LogisticRegression(max_iter=1000), X, labels, cv=5)
    print(f'{name:<20} {scores.mean():.3f} Β± {scores.std():.3f}  {X.shape[1]}')

print('\nsublinear_tf=True: uses 1 + log(TF) instead of raw TF β†’ handles high-frequency words better')
print('char n-grams: captures morphology β†’ robust to typos and rare words')

5. spaCy NLP Pipeline β€” Industrial-Strength ProcessingΒΆ

try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
    HAS_SPACY = True
except (ImportError, OSError):
    HAS_SPACY = False
    print('spaCy not available β€” showing expected output')

text = 'Apple Inc. CEO Tim Cook announced that the company will invest $430 billion in the US over the next five years.'

if HAS_SPACY:
    doc = nlp(text)
    
    print('--- Tokens + POS + Dependency ---')
    for token in doc:
        if not token.is_space:
            print(f'{token.text:<15} {token.pos_:<8} {token.dep_:<12} {token.lemma_}')
    
    print('\n--- Named Entities ---')
    for ent in doc.ents:
        print(f'{ent.text:<25} β†’ {ent.label_} ({spacy.explain(ent.label_)})')
    
    print('\n--- Sentences ---')
    for sent in doc.sents:
        print(f'  {sent.text}')
else:
    print('Expected spaCy output:')
    print()
    print('Named Entities:')
    print('  Apple Inc.          β†’ ORG  (Companies, agencies, institutions)')
    print('  Tim Cook            β†’ PERSON (People, including fictional)')
    print('  $430 billion        β†’ MONEY  (Monetary values)')
    print('  US                  β†’ GPE  (Countries, cities, states)')
    print('  next five years     β†’ DATE  (Absolute or relative dates)')
    print()
    print('Tokens (sample):')
    print('  Apple    PROPN  nsubj   Apple')
    print('  Inc.     PROPN  flat    inc.')
    print('  CEO      NOUN   appos   ceo')
    print('  invest   VERB   relcl   invest')
    print()
    print('spaCy pipeline components: tokenizer β†’ tagger β†’ parser β†’ ner β†’ lemmatizer')
# Word frequency analysis with visualization
all_tokens = []
for text in texts:
    tokens = normalize_text(text).split()
    tokens = remove_stopwords(tokens)
    all_tokens.extend(tokens)

word_freq = Counter(all_tokens)
top_words = word_freq.most_common(20)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
words, counts = zip(*top_words)
ax1.barh(words[::-1], counts[::-1], color='steelblue', alpha=0.8)
ax1.set_title('Top 20 Words After Normalization')
ax1.set_xlabel('Count')

# Zipf's law: word frequency follows power law
all_counts = sorted(word_freq.values(), reverse=True)
ranks = np.arange(1, len(all_counts) + 1)

ax2.loglog(ranks, all_counts, 'b.', alpha=0.5, markersize=4)
ax2.loglog(ranks, all_counts[0] / ranks, 'r-', label="Zipf's law: f ∝ 1/rank", linewidth=2)
ax2.set_xlabel('Word Rank')
ax2.set_ylabel('Frequency')
ax2.set_title("Zipf's Law in Text Corpora")
ax2.legend()

plt.tight_layout()
plt.show()

print("Zipf's Law: the most frequent word appears ~2x as often as the 2nd most frequent,")
print('~3x as often as the 3rd, etc. β†’ Universal property of natural language.')

Text Preprocessing Cheat SheetΒΆ

Task                    β†’ Method
──────────────────────────────────────────────────
Basic cleaning          β†’ lowercase, remove punct, strip whitespace
Tokenization            β†’ re.findall(r'\b\w+\b', text)
Stop word removal       β†’ filter against a stop word set
Stemming                β†’ nltk.stem.PorterStemmer (fast, approximate)
Lemmatization           β†’ spaCy token.lemma_ (slower, accurate)
Named Entity Extraction β†’ spaCy doc.ents
Text features           β†’ TfidfVectorizer (sparse), embeddings (dense)
Char-level features     β†’ TfidfVectorizer(analyzer='char_wb')
Modern NLP              β†’ HuggingFace AutoTokenizer + transformer model

When to skip preprocessing:
  - Transformer models (BERT, GPT): Don't lowercase or remove punct
    β†’ The tokenizer handles it, and casing/punct carry meaning
  - Traditional ML (TF-IDF + LR): Do normalize aggressively
    β†’ Reduces vocab size and helps generalization

ExercisesΒΆ

  1. Load the 20 Newsgroups dataset (sklearn.datasets.fetch_20newsgroups) and build a TF-IDF + LogReg classifier.

  2. Compare model accuracy with and without stop word removal on the 20 Newsgroups data.

  3. Write a spaCy custom pipeline component that counts noun phrase frequency in a document.

  4. Implement a simple spell checker using character n-gram similarity.

  5. Use CountVectorizer with vocabulary parameter to create a domain-specific feature space from a predefined word list.