Run this notebook: Open in Colab Open in Kaggle

set_index() and reset_index() Methods¶

The index in a Pandas DataFrame serves as the row identifier and is used for alignment, lookup, and joining operations. The set_index() method promotes an existing column to become the index, while reset_index() moves the current index back into a regular column and restores the default integer index. The drop parameter in reset_index() controls whether the old index is discarded or preserved as a column.

Why this matters: Choosing the right index is critical for efficient data access. Setting a meaningful column (like Film title or Date) as the index enables fast label-based lookups with .loc[] and makes join operations more intuitive. In time-series analysis, a datetime index unlocks powerful resampling and windowing functions.

import pandas as pd

bond = pd.read_csv('jamesbond.csv')
bond.head(3)

bond.set_index("Film", inplace = True)
bond.head(3)

bond.reset_index(drop = False, inplace = True)
bond.head(3)

bond.reset_index(inplace = True)
bond.set_index("Year", inplace = True)
bond.head(3)

Retrieve Rows by Index Label with .loc[]¶

The .loc[] accessor selects rows by their index label (not position). You can retrieve a single row, a slice of rows using label-based ranges (inclusive on both ends), or a list of specific labels. When the index is sorted, label-based slicing is both efficient and readable.

Why this matters: Label-based selection is more robust than position-based selection because it remains correct even after sorting or filtering operations that change row positions. In production code, using .loc[] with meaningful labels (like film titles or country names) makes the intent of your code clear and self-documenting.

bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

bond.loc["Goldfinger"]
bond.loc["GoldenEye"]
bond.loc["Casino Royale"]
bond.loc[: "On Her Majesty's Secret Service"]

bond.loc["Diamonds Are Forever" : "Moonraker"]
bond.loc["GoldenEye" :]

bond.loc[["Moonraker", "Octopussy"]]

bond.loc[["For Your Eyes Only", "Live and Let Die"]]

Retrieve Row(s) by Index Position with iloc¶

The .iloc[] accessor selects rows by their integer position (0-based), regardless of what the index labels contain. It supports single integers, lists of positions, and Python-style slicing (exclusive of the end position). Unlike .loc[], slicing with .iloc[] follows standard Python conventions.

Why this matters: Position-based access is essential when you need to grab the first N rows, sample specific positions, or iterate over a DataFrame in a predictable order. Combined with .loc[], these two accessors give you complete control over data selection – by label when you know what you are looking for, and by position when you know where it is.

bond = pd.read_csv("jamesbond.csv")
bond.head(3)

bond.loc[15]
bond.iloc[15]

bond.iloc[[15,20]]
bond.iloc[:4]
bond.iloc[4:8]
bond.iloc[20:]

bond = pd.read_csv("jamesbond.csv", index_col="Film")
bond.sort_index(inplace = True)
bond.head(3)

bond.iloc[0]

bond.iloc[[5, 10, 15, 20]]

The Catch-All .ix[] Method¶

The .ix[] accessor was a legacy method that attempted to handle both label-based and position-based indexing automatically. It has been deprecated in modern versions of Pandas because its ambiguous behavior led to subtle bugs – especially when the index contained integers, making it unclear whether a number referred to a label or a position.

Recommendation: Always use .loc[] for label-based access and .iloc[] for position-based access. The explicit separation eliminates ambiguity and makes your code more maintainable.

bond = pd.read_csv("jamesbond.csv", index_col="Film")
bond.sort_index(inplace = True)
bond.head(3)

bond.loc["GoldenEye"]

#ix method is depreciated

Second Argument to .loc[], .iloc[], and .ix[] Methods¶

Both .loc[] and .iloc[] accept a second argument for column selection, creating a row-and-column intersection. With .loc[], you specify column names or label-based slices; with .iloc[], you use integer positions. This two-argument syntax lets you extract a precise rectangular subset of your DataFrame in a single operation.

Why this matters: Selecting specific rows and columns simultaneously is the foundation of creating targeted views of your data. Rather than chaining a row filter with a column selection, the two-argument form is more concise, more efficient, and clearly communicates the exact slice of data you need.

pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

bond.loc["Moonraker", "Director" : "Budget"]

bond.iloc[14, [5,3,2]]

Set New Values for Specific Cell or Row¶

Using .loc[] on the left side of an assignment lets you modify specific cells or entire rows in place. You can update a single value with df.loc[row, column] = value or update multiple columns at once by passing a list of column names and a list of values.

Why this matters: In-place data correction is a routine part of data cleaning. You might fix a misspelled name, update an outdated value, or backfill missing financial figures. The .loc[] assignment syntax ensures you modify the original DataFrame (not a copy), avoiding the subtle SettingWithCopyWarning that plagues less precise approaches.

pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

bond.loc['Dr. No']

bond.loc['Dr. No', 'Actor'] = 'Sir Sean Connery'

bond.loc['Dr. No', ['Box Office', 'Budget', 'Bond Actor Salary']] = [448800000, 7000000, 600000]
bond.loc['Dr. No']

Set Multiple Values in DataFrame¶

Combining boolean masks with .loc[] assignment lets you update multiple rows at once based on a condition. First, create a mask that identifies the target rows, then use .loc[mask, column] = new_value to apply the change. This is far more efficient than looping through rows.

Why this matters: Bulk conditional updates are common in data normalization – for example, standardizing actor names, recoding categorical values, or applying business rules across an entire dataset. The vectorized mask-and-assign pattern is not only faster than iteration but also clearly expresses the intent: “for all rows matching this condition, set this column to that value.”

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

mask = bond['Actor'] == 'Sean Connery'

bond[mask]['Actor']

bond.loc[mask, 'Actor'] = 'Sir Sean Connery'

bond

Rename Index Labels or Columns in a DataFrame¶

The .rename() method accepts a dictionary mapping old names to new names, for either columns or index labels. For a complete column rename, you can also assign a new tuple or list directly to df.columns. Using inplace=True modifies the DataFrame without creating a copy.

Why this matters: Clear, consistent column names improve code readability and reduce errors when collaborating on shared datasets. Renaming is also necessary when merging datasets with different naming conventions, preparing data for APIs that expect specific field names, or making column names compatible with Python syntax (no spaces, no special characters) for use with .query() or dot notation access.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

bond.rename(columns = {'Year' : 'Release Date', 'Box Office' : 'Revenue'}, inplace = True)

bond.head(1)

bond.rename(index = {'Dr. No' : 'Doctor No', 
                     'GoldenEye' : 'Golden Eye', 
                     'The World is Not Enough' : 'Best Bond Movie Ever'}, inplace = True)

bond.head(1)

bond.columns = ('Year of Release', 'Actor', 'Director', 'Gross', 'Cost', 'Salary')

bond.head(3)

Delete Rows or Columns from a DataFrame¶

The .drop() method removes rows or columns by label. Set axis="columns" (or axis=1) to drop columns, and use axis="index" (or axis=0, the default) to drop rows. Pass a single label or a list of labels to remove multiple items at once.

Why this matters: Removing irrelevant or redundant columns reduces the dimensionality of your dataset, which can improve model performance and reduce training time. Dropping rows is useful for removing known bad records. In feature selection workflows, you often iteratively drop columns to find the minimal set of features that still produces accurate predictions.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

# bond.drop(['Casino Royale'], inplace = True)

bond.drop(labels = ['Box Office', 'Actor', 'Bond Actor Salary'], axis = "columns", inplace = True)

bond.head(3)

Create Random Sample¶

The .sample() method returns a random subset of rows or columns from a DataFrame. The n parameter specifies how many items to sample, and axis controls whether you sample rows or columns. Each call produces a different random selection unless you set a random_state for reproducibility.

Why this matters: Random sampling is fundamental to machine learning – from creating train/test splits to bootstrapping for statistical inference. Sampling columns can be useful for feature bagging in ensemble methods. In exploratory analysis, sampling a few rows from a large dataset gives you a quick sense of the data without loading everything into view.

pd.read_csv('jamesbond.csv', index_col='film')
bond.sort_index(inplace = True)
bond.head(3)

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

bond.sample(n=3, axis = 'columns')

The .nsmallest() and .nlargest() Methods¶

The .nlargest(n, columns) and .nsmallest(n, columns) methods return the top or bottom N rows based on values in a specified column. They are more efficient than sorting the entire DataFrame when you only need a small number of extreme values.

Why this matters: Finding top-N or bottom-N records is one of the most common analytical queries – highest revenue films, lowest-paid employees, most expensive transactions. These methods are optimized to perform a partial sort, making them significantly faster than a full .sort_values() followed by .head() on large datasets.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

bond.sort_values('Box Office', ascending = False)
bond.head(3)

bond.nlargest(3, columns='Box Office')

bond.nsmallest(3, columns='Box Office')

bond.nlargest(3, columns='Budget')

bond.nsmallest(3, columns='Bond Actor Salary')

bond['Box Office'].nlargest(8)

bond['Year'].nsmallest(2)

Filtering with the where Method¶

The .where() method applies a boolean condition to the DataFrame but, unlike bracket-based filtering, it preserves the original shape by filling non-matching rows with NaN instead of removing them. This is useful when you need to maintain index alignment across filtered and unfiltered views.

Why this matters: Shape-preserving filters are essential in time-series and panel data analysis where positional alignment matters. The .where() method is also the Pandas equivalent of SQL’s CASE WHEN for conditional value replacement, making it a versatile tool for creating conditional columns or masked views of your data.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

mask = bond['Actor'] == 'Sean Connery'
bond[mask]

bond.where(mask)

The .query() Method¶

The .query() method lets you filter a DataFrame using a string expression that resembles SQL syntax. Before using it, column names must not contain spaces (replace them with underscores). You can use standard comparison operators, logical operators (and, or, not), and even the in keyword for membership tests.

Why this matters: For complex filtering conditions, .query() produces more readable code than chaining boolean masks with & and |. It is especially useful when filter criteria are dynamic – you can construct query strings programmatically from user input or configuration files. The SQL-like syntax also makes it more accessible to team members with a database background.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

bond.columns

bond.columns = (column_name.replace(" ", "_") for column_name in bond.columns)
bond.head(1)

bond.query('Actor == "Sean Connery"')

bond.query('Director == "Terence Young"')

bond.query("Actor != 'Roger Moore'")

bond.query('Box_Office > 600')

bond.query("Actor == 'Roger Moore' or Director == 'John Glen'")

bond.query("Actor in('Timothy Dalton', 'George Lazenby')")

bond.query("Actor not in('Sean Connery', 'Roger Moore')")

A Review of the .apply() Method on Single Columns¶

The .apply() method runs a function on every element in a Series (or every row/column in a DataFrame). You pass a regular function or a lambda, and Pandas applies it element-wise, returning a new Series with the transformed values. This is the go-to approach when vectorized operations are not available for your transformation.

Why this matters: Custom transformations are the bridge between raw data and features your model can use. While vectorized operations are faster, .apply() handles cases that require complex logic – such as string formatting, conditional mapping, or calling external APIs per row. Understanding when to use .apply() versus vectorized methods is key to writing performant Pandas code.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

def convert_to_string_and_add_millions(number):
    return str(number) + " MILLIONS!"

bond["Box Office"] = bond["Box Office"].apply(convert_to_string_and_add_millions)

bond["Budget"] = bond["Budget"].apply(convert_to_string_and_add_millions)

The .copy() Method¶

When you assign a column to a new variable (e.g., directors = bond["Director"]), Pandas creates a view, not an independent copy. Changes to the view can propagate back to the original DataFrame, leading to unexpected side effects. The .copy() method creates a fully independent duplicate, breaking the link between the original and the new object.

Why this matters: Unintended mutation of source data is one of the most common bugs in data analysis code. Using .copy() when you intend to modify a subset ensures your original DataFrame remains untouched – critical in pipelines where the same source data feeds multiple downstream transformations or when you need to compare before-and-after states.

bond = pd.read_csv('jamesbond.csv', index_col = 'Film')
bond.sort_index(inplace = True)
bond.head(3)

directors = bond["Director"]
directors.head(3)

directors["A View to a Kill"]

directors.head(3)