EDA in PandasΒΆ
Exploratory Data Analysis (EDA) is the critical first phase of any data science project. Before building models, you need to understand your dataβs structure, distribution, relationships, and quality issues. EDA combines summary statistics, correlation analysis, grouping, and visualization to build intuition about the dataset.
This notebook demonstrates a complete EDA workflow on world population data using Pandas, Seaborn, and Matplotlib. Key techniques include: inspecting data types with .info(), generating descriptive statistics with .describe(), checking for missing values with .isnull().sum(), computing correlation matrices with .corr(), grouping and aggregating with .groupby(), filtering with string methods, transposing DataFrames for plotting, and creating heatmaps, line plots, and box plots to visualize distributions and trends.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv(r"C:\Users\alexf\OneDrive\Documents\Pandas Tutorial\world_population.csv")
df
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.info()
df.describe()
df.isnull().sum()
df.nunique()
df.sort_values(by="World Population Percentage", ascending=False).head(10)
df.corr()
sns.heatmap(df.corr(), annot = True)
plt.rcParams['figure.figsize'] = (20,7)
plt.show()
df.groupby('Continent').mean().sort_values(by="2022 Population",ascending=False)
df[df['Continent'].str.contains('Oceania')]
df2 = df.groupby('Continent')[['1970 Population',
'1980 Population', '1990 Population', '2000 Population',
'2010 Population', '2015 Population', '2020 Population',
'2022 Population']].mean().sort_values(by="2022 Population",ascending=False)
df2
df.columns
df3 = df2.transpose()
df3
df3.plot()
df.boxplot(figsize=(20,10))
df.select_dtypes(include='float')