EDA in PandasΒΆ
Exploratory Data Analysis (EDA) is the systematic process of examining a datasetβs structure, distributions, and relationships before building any models. It combines summary statistics, null analysis, correlation computation, grouping, and visualization to uncover patterns and data quality issues.
This notebook walks through a complete EDA workflow on world population data: formatting float display with pd.set_option(), inspecting structure with .info() and .describe(), auditing missing values with .isnull().sum(), measuring cardinality with .nunique(), computing and visualizing correlation matrices with .corr() and Seaborn heatmaps, grouping by continent with .groupby(), filtering with .str.contains(), transposing grouped data for time-series plotting, and creating box plots to identify outliers.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv(r"C:\Users\alexf\OneDrive\Documents\Pandas Tutorial\world_population.csv")
df
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.info()
df.describe()
df.isnull().sum()
df.nunique()
df.sort_values(by="World Population Percentage", ascending=False).head(10)
df.corr()
sns.heatmap(df.corr(), annot = True)
plt.rcParams['figure.figsize'] = (20,7)
plt.show()
df.groupby('Continent').mean().sort_values(by="2022 Population",ascending=False)
df[df['Continent'].str.contains('Oceania')]
df2 = df.groupby('Continent')[['1970 Population',
'1980 Population', '1990 Population', '2000 Population',
'2010 Population', '2015 Population', '2020 Population',
'2022 Population']].mean().sort_values(by="2022 Population",ascending=False)
df2
df.columns
df3 = df2.transpose()
df3
df3.plot()
df.boxplot(figsize=(20,10))
df.select_dtypes(include='float')