Run this notebook: Open in Colab Open in Kaggle

import pandas as pd

Working with Text Data¶

Text manipulation is a critical skill for data cleaning and preprocessing. Pandas provides the .str accessor on Series objects, which exposes vectorized string methods that mirror Python’s built-in string methods but operate on entire columns at once. This notebook demonstrates .str.lower(), .str.title(), .str.len(), .str.replace() for removing currency symbols, .str.strip() for whitespace cleanup, applying string transformations to index labels and column headers, and chaining string operations like .str.split().str.get() for extracting substrings.

Why this matters: Raw datasets frequently contain inconsistent formatting – mixed case, leading/trailing spaces, embedded currency symbols, and concatenated fields that need splitting. Cleaning text data with vectorized .str methods is both faster and more readable than applying Python string methods row by row with loops.

chicago = pd.read_csv("chicago.csv")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

chicago.info()

chicago["Department"] = chicago["Department"].astype("category")

"HELLO WORLD".lower()

chicago["Name"].str.lower()

chicago["Name"].str.title()

chicago["Department"].str.len()

"Hello World".replace("l","t")

chicago.head()

chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].str.replace("$","").astype(float)

chicago

chicago["Name"].str.strip()

chicago = pd.read_csv("chicago.csv", index_col="Name").dropna(how = "all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

chicago.index.nunique()

chicago.index = chicago.index.str.strip().str.title()

chicago.columns = chicago.columns.str.title()

chicago.head()

chicago = pd.read_csv("chicago.csv").dropna(how = "all")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head()

chicago["Name"].str.split(",").str.get(0).str.title().value_counts()