Run this notebook: Open in Colab Open in Kaggle

Data Cleaning in Pandas¶

Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing values in a dataset. It is often the most time-consuming phase of a data science project, but it directly determines the quality of any downstream analysis or model.

This notebook demonstrates a practical cleaning workflow on a customer call list loaded from Excel: removing duplicate rows with drop_duplicates(), dropping irrelevant columns, stripping unwanted characters from text fields with .str.strip(), standardizing phone numbers with .str.replace(), splitting compound address fields into separate columns with .str.split(expand=True), normalizing categorical values, filling missing data with .fillna(), conditionally dropping rows based on business rules, and resetting the index for a clean final output.

import pandas as pd

df = pd.read_excel(r"C:\Users\alexf\OneDrive\Documents\Pandas Tutorial\Customer Call List.xlsx")
df

df = df.drop_duplicates()
df

df = df.drop(columns = "Not_Useful_Column")
df

#df["Last_Name"] = df["Last_Name"].str.lstrip("...")
#df["Last_Name"] = df["Last_Name"].str.lstrip("/")
#df["Last_Name"] = df["Last_Name"].str.rstrip("_")
df["Last_Name"] = df["Last_Name"].str.strip("123._/")
df

#df["Phone_Number"] = df["Phone_Number"].str.replace('[^a-zA-Z0-9]','')

#df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])

#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: str(x))

#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])

df["Phone_Number"] = df["Phone_Number"].str.replace('nan--','')

df["Phone_Number"] = df["Phone_Number"].str.replace('Na--','')
df

df[["Street_Address", "State", "Zip_Code"]] = df["Address"].str.split(',',2, expand=True)
df

df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('Yes','Y')

df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('No','N')
df

#df = df.replace('N/a','')
#df = df.replace('NaN','')


df=df.fillna('')
df

for x in df.index:
    if df.loc[x, "Do_Not_Contact"] == 'Y':
        df.drop(x, inplace=True)

df

for x in df.index:
    if df.loc[x, "Phone_Number"] == '':
        df.drop(x, inplace=True)

df

#Another way to drop null values
#df = df.dropna(subset="Phone_Number"), inplace=True)

df = df.reset_index(drop=True)
df