Run this notebook: Open in Colab Open in Kaggle

This notebook was prepared by Donne Martin. Source and license info is on GitHub.

matplotlib-applied¶

Applying Matplotlib Visualizations to Kaggle: Titanic
Bar Plots, Histograms, subplot2grid
Normalized Plots
Scatter Plots, subplots
Kernel Density Estimation Plots

Applying Matplotlib Visualizations to Kaggle: Titanic¶

The Titanic dataset is one of the most popular beginner datasets in data science because it combines real historical data with a clear binary classification target (survived vs. died). Before building any predictive model, exploratory data analysis (EDA) through visualization is essential for understanding feature distributions, class imbalances, and potential predictive signals. The data cleaning function below handles common real-world issues: encoding categorical variables as numbers, filling missing values with group-level medians, and engineering new features like FamilySize from existing columns.

Setting up the environment with inline plotting enabled, and loading the Titanic training data. The clean_data function below performs several critical preprocessing steps: mapping Sex to numeric values, one-hot encoding Embarked, filling missing Age values with class/gender-specific medians (a more accurate imputation than a global mean), and creating a FamilySize feature that combines sibling/spouse and parent/child counts.

%matplotlib inline
import pandas as pd
import numpy as np
import pylab as plt
import seaborn

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Set seaborn aesthetic parameters to defaults
seaborn.set()

df_train = pd.read_csv('../data/titanic/train.csv')

def clean_data(df):
    
    # Get the unique values of Sex
    sexes = np.sort(df['Sex'].unique())
    
    # Generate a mapping of Sex from a string to a number representation    
    genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))

    # Transform Sex from a string to a number representation
    df['Sex_Val'] = df['Sex'].map(genders_mapping).astype(int)
    
    # Get the unique values of Embarked
    embarked_locs = np.sort(df['Embarked'].unique())

    # Generate a mapping of Embarked from a string to a number representation        
    embarked_locs_mapping = dict(zip(embarked_locs, 
                                     range(0, len(embarked_locs) + 1)))
    
    # Transform Embarked from a string to dummy variables
    df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked_Val')], axis=1)
    
    # Fill in missing values of Embarked
    # Since the vast majority of passengers embarked in 'S': 3, 
    # we assign the missing values in Embarked to 'S':
    if len(df[df['Embarked'].isnull()] > 0):
        df.replace({'Embarked_Val' : 
                       { embarked_locs_mapping[np.nan] : embarked_locs_mapping['S'] 
                       }
                   }, 
                   inplace=True)
    
    # Fill in missing values of Fare with the average Fare
    if len(df[df['Fare'].isnull()] > 0):
        avg_fare = df['Fare'].mean()
        df.replace({ None: avg_fare }, inplace=True)
    
    # To keep Age in tact, make a copy of it called AgeFill 
    # that we will use to fill in the missing ages:
    df['AgeFill'] = df['Age']

    # Determine the Age typical for each passenger class by Sex_Val.  
    # We'll use the median instead of the mean because the Age 
    # histogram seems to be right skewed.
    df['AgeFill'] = df['AgeFill'] \
                        .groupby([df['Sex_Val'], df['Pclass']]) \
                        .apply(lambda x: x.fillna(x.median()))
            
    # Define a new feature FamilySize that is the sum of 
    # Parch (number of parents or children on board) and 
    # SibSp (number of siblings or spouses):
    df['FamilySize'] = df['SibSp'] + df['Parch']
    
    return df

df_train = clean_data(df_train)

Bar Plots, Histograms, and subplot2grid¶

subplot2grid provides fine-grained control over subplot placement within a grid. Each call specifies the grid dimensions and the cell position (row, col) for that subplot. Bar plots created with .value_counts().plot(kind='bar') are ideal for visualizing the frequency of categorical variables – here we examine survival counts, passenger class distribution, gender split, and embarkation ports. The Age histogram reveals the age distribution of passengers. Together, these six panels give a rapid overview of the dataset’s structure and potential class imbalances.

# Size of matplotlib figures that contain subplots
figsize_with_subplots = (10, 10)

# Set up a grid of plots
fig = plt.figure(figsize=figsize_with_subplots) 
fig_dims = (3, 2)

# Plot death and survival counts
plt.subplot2grid(fig_dims, (0, 0))
df_train['Survived'].value_counts().plot(kind='bar', 
                                         title='Death and Survival Counts',
                                         color='r',
                                         align='center')

# Plot Pclass counts
plt.subplot2grid(fig_dims, (0, 1))
df_train['Pclass'].value_counts().plot(kind='bar', 
                                       title='Passenger Class Counts')

# Plot Sex counts
plt.subplot2grid(fig_dims, (1, 0))
df_train['Sex'].value_counts().plot(kind='bar', 
                                    title='Gender Counts')
plt.xticks(rotation=0)

# Plot Embarked counts
plt.subplot2grid(fig_dims, (1, 1))
df_train['Embarked'].value_counts().plot(kind='bar', 
                                         title='Ports of Embarkation Counts')

# Plot the Age histogram
plt.subplot2grid(fig_dims, (2, 0))
df_train['Age'].hist()
plt.title('Age Histogram')

# Get the unique values of Embarked and its maximum
family_sizes = np.sort(df_train['FamilySize'].unique())
family_size_max = max(family_sizes)

df1 = df_train[df_train['Survived'] == 0]['FamilySize']
df2 = df_train[df_train['Survived'] == 1]['FamilySize']
plt.hist([df1, df2], 
         bins=family_size_max + 1, 
         range=(0, family_size_max), 
         stacked=True)
plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Family Size')

Normalized Stacked Bar Plots¶

Raw counts can be misleading when group sizes differ – for example, there are far more 3rd-class passengers than 1st-class. Normalized plots convert counts to proportions (0 to 1) so you can fairly compare survival rates across groups. pd.crosstab() creates a contingency table, and dividing by row sums normalizes each row. The stacked bar then shows the proportion who survived vs. died within each passenger class, and separately for males and females. These plots reveal that 1st-class females had the highest survival rate, a pattern consistent with the “women and children first” evacuation protocol.

pclass_xt = pd.crosstab(df_train['Pclass'], df_train['Survived'])

# Normalize the cross tab to sum to 1:
pclass_xt_pct = pclass_xt.div(pclass_xt.sum(1).astype(float), axis=0)

pclass_xt_pct.plot(kind='bar', 
                   stacked=True, 
                   title='Survival Rate by Passenger Classes')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

# Plot survival rate by Sex
females_df = df_train[df_train['Sex'] == 'female']
females_xt = pd.crosstab(females_df['Pclass'], df_train['Survived'])
females_xt_pct = females_xt.div(females_xt.sum(1).astype(float), axis=0)
females_xt_pct.plot(kind='bar', 
                    stacked=True, 
                    title='Female Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

# Plot survival rate by Pclass
males_df = df_train[df_train['Sex'] == 'male']
males_xt = pd.crosstab(males_df['Pclass'], df_train['Survived'])
males_xt_pct = males_xt.div(males_xt.sum(1).astype(float), axis=0)
males_xt_pct.plot(kind='bar', 
                  stacked=True, 
                  title='Male Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

Scatter Plots and Subplots¶

Using plt.subplots(2, 1) creates a figure with two vertically stacked axes. The scatter plot shows age versus survival status (0 or 1), which helps identify whether certain age groups had higher survival. The stacked histogram below it bins passengers by age group and colors each bin by survival outcome, making it easy to spot that very young children had relatively higher survival rates. When combining scatter and histogram views, you get both individual-level and distribution-level perspectives on the same relationship.

# Set up a grid of plots
fig, axes = plt.subplots(2, 1, figsize=figsize_with_subplots)

# Histogram of AgeFill segmented by Survived
df1 = df_train[df_train['Survived'] == 0]['Age']
df2 = df_train[df_train['Survived'] == 1]['Age']
max_age = max(df_train['AgeFill'])

axes[1].hist([df1, df2], 
             bins=max_age / 10, 
             range=(1, max_age), 
             stacked=True)
axes[1].legend(('Died', 'Survived'), loc='best')
axes[1].set_title('Survivors by Age Groups Histogram')
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Count')

# Scatter plot Survived and AgeFill
axes[0].scatter(df_train['Survived'], df_train['AgeFill'])
axes[0].set_title('Survivors by Age Plot')
axes[0].set_xlabel('Survived')
axes[0].set_ylabel('Age')

Kernel Density Estimation (KDE) Plots¶

A KDE plot is a smoothed version of a histogram that estimates the probability density function of a variable. By plotting separate KDE curves for each passenger class, you can compare their age distributions without the binning artifacts of histograms. The .plot(kind='kde') method in pandas uses Gaussian kernels to smooth the data. In the Titanic dataset, you will typically see that 1st-class passengers skew older, while 3rd-class passengers skew younger – a demographic pattern that influenced survival outcomes.

# Get the unique values of Pclass:
passenger_classes = np.sort(df_train['Pclass'].unique())

for pclass in passenger_classes:
    df_train.AgeFill[df_train.Pclass == pclass].plot(kind='kde')
plt.title('Age Density Plot by Passenger Class')
plt.xlabel('Age')
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')