Run this notebook: Open in Colab Open in Kaggle

import pandas as pd

GroupBy – Split-Apply-Combine for Aggregation¶

The GroupBy operation is one of the most powerful tools in Pandas for summarizing data. It follows a split-apply-combine pattern: split the DataFrame into groups based on one or more columns, apply an aggregation function to each group, and combine the results. This notebook demonstrates grouping the Fortune 1000 dataset by sector and industry, inspecting group objects with .first() and .groups, extracting specific groups with .get_group(), computing aggregations like .max(), .min(), .sum(), using .agg() with a dictionary for per-column aggregation functions, and iterating over groups to extract the highest-revenue company per sector. These techniques are the Pandas equivalent of SQL GROUP BY and are essential for business analytics and reporting.

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")
fortune.head()

type(sectors)

len(sectors)

fortune["Sector"].nunique()

sectors.first()

sectors.groups

fortune.loc[24,]

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")
fortune.head()

sectors.get_group("Energy")
sectors.get_group("Technology")

sectors.max()
sectors.min()
sectors.sum()

sectors.get_group("Apparel")["Profits"].sum()

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby(["Sector","Industry"])
fortune.head()

sectors.sum()

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")
fortune.head()

sectors.agg({"Revenue":"sum",
            "Profits":"sum",
            "Employees":"mean"})

df = pd.DataFrame(columns= fortune.columns)

for sector, data in sectors:
    highest_revenue = data.nlargest(1, "Revenue")
    df = df.append(highest_revenue)

df

cities = fortune.groupby("Location")

for sector, data in cities:
    highest_revenue = data.nlargest(1, "Revenue")
    df = df.append(highest_revenue)

df