Run this notebook: Open in Colab Open in Kaggle

import pandas as pd

nba = pd.read_csv("nba.csv")

nba.head(7)

nba.tail(5)

nba.index

nba.values

nba.shape

nba.dtypes

nba.columns

nba.axes

nba.info()

nba.get_dtype_counts()

Differences between Shared Methods¶

Pandas Series and DataFrames share many methods like .sum(), .mean(), and .max(), but they behave differently depending on the data structure. When called on a Series, these methods operate on a single column and return a scalar value. When called on a DataFrame, they operate along an axis – by default summing each column (axis=0), but you can switch to row-wise aggregation with axis='columns'.

Why this matters: Understanding axis behavior is fundamental to avoiding subtle bugs in data pipelines. In real-world datasets with multiple numeric columns (like revenue streams by product line), choosing the correct axis determines whether you get totals per metric or totals per observation.

rev = pd.read_csv("revenue.csv", index_col = "Date")
rev.head(3)

s = pd.Series([1,2,3])
s.sum()

rev.sum()

rev.sum(axis='columns')

Selecting One Column from a DataFrame¶

Accessing a single column from a DataFrame is one of the most frequent operations in data analysis. Pandas provides two syntaxes: dot notation (df.Name) and bracket notation (df["Name"]). Both return a Pandas Series, which is the one-dimensional building block underlying every DataFrame column.

Bracket notation is preferred because dot notation fails when column names contain spaces, conflict with built-in DataFrame attributes, or match Python reserved words. In ML workflows, column selection is the first step toward feature engineering – isolating the variables you want to transform, scale, or feed into a model.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.Name

nba.Number

nba.Salary

nba["Name"].head(3)

nba["Number"]

nba["Salary"]

outpost = None

type(nba["Name"])

Select Two or More Columns in DataFrame¶

To select multiple columns, pass a list of column names inside double brackets: df[["col1", "col2"]]. The result is a new DataFrame (not a Series), preserving the two-dimensional structure. You can also store the column list in a variable for reuse, which keeps your code DRY when applying the same selection in multiple places.

Why this matters: Multi-column selection is essential for creating feature matrices in machine learning. When preparing data for a model, you typically select a subset of columns as your feature set (X) and a single column as your target (y). Mastering this operation makes that workflow second nature.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba[["Team", "Name"]].head(3)

nba[["Number", "College"]].head(3)

nba[["Salary", "Team", "Name"]].tail(5)

select = ["Salary", "Team", "Name"]
nba[select] #the same method as above

Add New Column to DataFrame¶

Pandas offers two main ways to add columns. The simplest is direct assignment (df["new_col"] = value), which appends the column at the end. For precise positioning, df.insert(position, column, value) lets you place the new column at a specific index.

Why this matters: Adding derived columns is the heart of feature engineering. In practice, you might create a “Sport” label for merging datasets, compute a ratio of two existing columns, or flag rows meeting a business rule. The insert() method is particularly useful when column order matters for reporting or when downstream code expects columns in a specific position.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba["Sport"] = "Basketball"

nba.head(3)

nba["League"] = "National Basketball Association"

nba.head(3)

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.insert(3, column = "Sport", value = "Basketball")
nba.head(3)

nba.insert(7, column = "League", value = "National Basketball Association")
nba.head(3)

Broadcasting Operations¶

Broadcasting lets you apply a mathematical operation to every element in a column at once, without writing a loop. Pandas supports both method syntax (.add(), .sub(), .mul(), .div()) and standard arithmetic operators (+, -, *, /). The result is a new Series with the operation applied element-wise.

Why this matters: Broadcasting is how you perform feature scaling, unit conversions, and normalization – all critical preprocessing steps before training ML models. For example, converting weight from pounds to kilograms or normalizing salary into millions are common transformations that make data more interpretable and model-ready. The vectorized approach is also orders of magnitude faster than iterating row by row.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba["Age"].add(5)
nba["Age"] + 5 #this works the same

nba["Salary"].sub(5000000)
nba["Salary"] - 5000000

nba["Weight"].mul(0.34564)
nba["Weight"] * 0.3456

nba["Weight in Kilograms"] = nba["Weight"] * 0.3456

nba.head(3)

nba["Salary"].div(1000000)
nba["Salary in Millions"] = nba["Salary"] / 1000000

nba.head(3)

A Review of the .value_counts() Method¶

The .value_counts() method returns a Series containing the frequency of each unique value in a column, sorted in descending order by default. It is one of the quickest ways to understand the distribution of categorical or discrete data.

Why this matters: Frequency analysis reveals class imbalance (critical for classification tasks), data quality issues (unexpected categories), and dominant patterns in your dataset. In real-world projects, running value_counts() on key columns is a standard first step in exploratory data analysis before any modeling begins.

nba["Team"].value_counts()
nba["Position"].value_counts().head(1)

nba["Weight"].value_counts().tail()

nba["Salary"].value_counts()

Drop Rows with Null Values¶

The .dropna() method removes rows (or columns) that contain missing values. The how parameter controls the behavior: how="all" drops rows only when every value is NaN, while the default how="any" drops rows with at least one missing value. The subset parameter lets you target specific columns for the null check.

Why this matters: Most machine learning algorithms cannot handle missing values natively. Deciding how to handle nulls – dropping, filling, or imputing – is one of the most consequential data cleaning decisions you will make. Dropping rows with how="all" is a safe first pass to remove completely empty records, while subset gives you fine-grained control over which columns must be present.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.tail(3)

nba.dropna(how = "all")

nba.dropna(subset = ["Salary", "College"])

Fill in Null Values with the .fillna() Method¶

Instead of dropping rows with missing data, .fillna() lets you replace NaN values with a specified default. You can fill an entire DataFrame at once or target individual columns with appropriate fill values. The inplace=True parameter modifies the DataFrame directly rather than returning a copy.

Why this matters: Filling nulls preserves your dataset size, which is critical when data is scarce. The choice of fill value depends on context: zero works for numeric fields where absence means “none,” while a descriptive string like “No College” is better for categorical fields. In production ML pipelines, consistent null-handling logic ensures your model receives clean inputs at inference time.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.fillna(0) #this works in a consistent data sets where you need to include integers or floating point values

nba["Salary"].fillna(0, inplace = True)

nba.head()

nba["College"].fillna("No College", inplace = True)

nba.head(5)

The .astype() Method¶

The .astype() method converts a column from one data type to another – for example, from float to int, or from object (string) to category. Before converting, you usually need to handle null values first, since NaN cannot be represented in integer types.

Why this matters: Correct data types reduce memory usage and prevent unexpected behavior in calculations. Converting low-cardinality string columns to the category dtype can cut memory usage dramatically on large datasets. Using .nunique() beforehand helps you identify which columns are good candidates for category conversion. In ML pipelines, proper typing ensures that encoding steps (like one-hot encoding) work as intended.

nba = pd.read_csv("nba.csv").dropna(how = "all")
nba["Salary"].fillna(0, inplace = True)
nba["College"].fillna("None", inplace = True)
nba.head(6)

nba.dtypes
nba.info()

nba["Salary"] = nba["Salary"].astype("int")

nba.head(3)

nba["Number"] = nba["Number"].astype("int")

nba["Age"] = nba["Age"].astype("int")
nba.head(3)

nba["Position"].nunique()

nba["Position"] = nba["Position"].astype("category")

nba["Team"] = nba["Team"].astype("category")

nba.head(3)

Sort a DataFrame with the .sort_values() Method, Part 1¶

The .sort_values() method reorders rows based on one or more columns. The ascending parameter controls sort direction, and na_position determines whether NaN values appear first or last. Using inplace=True modifies the DataFrame directly.

Why this matters: Sorting is essential for identifying top-N records (highest salaries, best-performing products) and for visual inspection during exploratory analysis. The na_position parameter is particularly important when you need to audit missing data – placing nulls first makes them immediately visible at the top of your output.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.sort_values("Name", ascending = False)

nba.sort_values("Age", ascending = False)

nba.sort_values("Salary", ascending = False, inplace = True)
nba.head(3)

nba.sort_values("Salary", ascending = False, na_position = "first").tail()

Sort a DataFrame with the .sort_values() Method, Part 2¶

When sorting by multiple columns, pass a list of column names and a corresponding list of ascending/descending flags. Pandas applies the sort hierarchically – the first column is the primary sort key, and ties are broken by subsequent columns.

Why this matters: Multi-column sorting mirrors how databases handle ORDER BY clauses with multiple fields. In business reporting, you might sort employees first by department (alphabetically) and then by salary (descending) within each department to quickly identify top earners per team.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.sort_values(["Name", "Team"], ascending = [False, True], inplace = True)
nba.head(3)

Sort a DataFrame with the .sort_index() Method¶

After sorting by values, the original row index becomes scrambled. The .sort_index() method restores the DataFrame to its original positional order by sorting on the index. This is useful after performing value-based sorts when you want to return to the default row ordering.

Why this matters: Maintaining a predictable index order is important for reproducibility and for operations that rely on positional alignment, such as concatenation or time-series analysis where the index represents chronological order.

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.sort_values(["Number", "Salary", "Name"], inplace = True)
nba.tail(3)

nba.sort_index(ascending = False, inplace = True)

nba.head(3)

Rank Values with the .rank() Method¶

The .rank() method assigns a numerical rank to each value in a column, with ascending=False giving rank 1 to the highest value. Combined with .astype("int"), you can create clean integer rankings that are easy to read and filter on.

Why this matters: Ranking is a common requirement in leaderboards, percentile calculations, and competition-style analyses. Unlike sorting (which reorders rows), ranking adds a new informational column while preserving the original row order, making it easy to compare an entity’s rank against its other attributes.

nba = pd.read_csv("nba.csv").dropna(how = "all")
nba["Salary"] = nba["Salary"].fillna(0).astype("int")
nba.head(3)

nba["Salary Rank"] = nba["Salary"].rank(ascending = False).astype("int")
nba.head(3)

nba.sort_values(by = "Salary", ascending = False)