Data Cleaning in PandasΒΆ
Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing values in a dataset. It is often the most time-consuming phase of a data science project, but it directly determines the quality of any downstream analysis or model.
This notebook demonstrates a practical cleaning workflow on a customer call list loaded from Excel: removing duplicate rows with drop_duplicates(), dropping irrelevant columns, stripping unwanted characters from text fields with .str.strip(), standardizing phone numbers with .str.replace(), splitting compound address fields into separate columns with .str.split(expand=True), normalizing categorical values, filling missing data with .fillna(), conditionally dropping rows based on business rules, and resetting the index for a clean final output.
import pandas as pd
df = pd.read_excel(r"C:\Users\alexf\OneDrive\Documents\Pandas Tutorial\Customer Call List.xlsx")
df
df = df.drop_duplicates()
df
df = df.drop(columns = "Not_Useful_Column")
df
#df["Last_Name"] = df["Last_Name"].str.lstrip("...")
#df["Last_Name"] = df["Last_Name"].str.lstrip("/")
#df["Last_Name"] = df["Last_Name"].str.rstrip("_")
df["Last_Name"] = df["Last_Name"].str.strip("123._/")
df
#df["Phone_Number"] = df["Phone_Number"].str.replace('[^a-zA-Z0-9]','')
#df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: str(x))
#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
df["Phone_Number"] = df["Phone_Number"].str.replace('nan--','')
df["Phone_Number"] = df["Phone_Number"].str.replace('Na--','')
df
df[["Street_Address", "State", "Zip_Code"]] = df["Address"].str.split(',',2, expand=True)
df
df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('Yes','Y')
df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('No','N')
df
#df = df.replace('N/a','')
#df = df.replace('NaN','')
df=df.fillna('')
df
for x in df.index:
if df.loc[x, "Do_Not_Contact"] == 'Y':
df.drop(x, inplace=True)
df
for x in df.index:
if df.loc[x, "Phone_Number"] == '':
df.drop(x, inplace=True)
df
#Another way to drop null values
#df = df.dropna(subset="Phone_Number"), inplace=True)
df = df.reset_index(drop=True)
df