Data Cleaning in PandasΒΆ
Data cleaning (also called data wrangling or data munging) is the process of detecting and correcting corrupt, inaccurate, or unusable records in a dataset. In real-world projects, data scientists spend an estimated 60-80% of their time on cleaning tasks before any modeling can begin.
This notebook walks through a practical data cleaning workflow on a customer call list: removing duplicates, dropping useless columns, stripping unwanted characters from text fields, standardizing phone numbers, splitting address fields, normalizing categorical values, handling missing data, and filtering out records based on business rules. Each of these steps uses core Pandas methods like drop_duplicates(), str.strip(), str.split(), fillna(), and conditional row dropping.
import pandas as pd
df = pd.read_excel(r"C:\Users\alexf\OneDrive\Documents\Pandas Tutorial\Customer Call List.xlsx")
df
df = df.drop_duplicates()
df
df = df.drop(columns = "Not_Useful_Column")
df
#df["Last_Name"] = df["Last_Name"].str.lstrip("...")
#df["Last_Name"] = df["Last_Name"].str.lstrip("/")
#df["Last_Name"] = df["Last_Name"].str.rstrip("_")
df["Last_Name"] = df["Last_Name"].str.strip("123._/")
df
#df["Phone_Number"] = df["Phone_Number"].str.replace('[^a-zA-Z0-9]','')
#df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: str(x))
#df["Phone_Number"] = df["Phone_Number"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
df["Phone_Number"] = df["Phone_Number"].str.replace('nan--','')
df["Phone_Number"] = df["Phone_Number"].str.replace('Na--','')
df
df[["Street_Address", "State", "Zip_Code"]] = df["Address"].str.split(',',2, expand=True)
df
df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('Yes','Y')
df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace('No','N')
df
#df = df.replace('N/a','')
#df = df.replace('NaN','')
df=df.fillna('')
df
for x in df.index:
if df.loc[x, "Do_Not_Contact"] == 'Y':
df.drop(x, inplace=True)
df
for x in df.index:
if df.loc[x, "Phone_Number"] == '':
df.drop(x, inplace=True)
df
#Another way to drop null values
#df = df.dropna(subset="Phone_Number"), inplace=True)
df = df.reset_index(drop=True)
df