Manipulating text in DataFramesΒΆ
Pandas calls it manipulating textual data, and text is one of the most predominant data types you will encounter in datasets besides integers, floats, and booleans. Being able to process and manipulate text is useful when you need to normalize data in cells
# Load your dataframe
import pandas as pd
csv_url = "https://raw.githubusercontent.com/paiml/wine-ratings/main/wine-ratings.csv"
df = pd.read_csv(csv_url, index_col=0)
df.head()
# manipulate the variety to be R for red or W for white
df["variety_short"] = df["variety"].replace({"Red Wine": "R", "White Wine": "W"})
df.head()
# with high confidence, split the region and keep only the last part
# warning! you could operate on the same column, or create a new one!
df["region_short"] = df["region"].str.split().str.get(-1)
df.query("region_short != 'California'").head()