Run this notebook: Open in Colab Open in Kaggle

Common Dataframe operations¶

Selecting, or subset selections of data like columns and rows are very common when working with datasets that might have data that is not interesting or needs to be dropped.

# Load your dataframe
import pandas as pd
csv_url = "https://raw.githubusercontent.com/paiml/wine-ratings/main/wine-ratings.csv"
df = pd.read_csv(csv_url, index_col=0)
df.head()

DataFrame columns¶

Selecting columns is the most fundamental DataFrame operation. Single-bracket notation (df["col"]) returns a Series, while double-bracket notation (df[["col1", "col2"]]) returns a new DataFrame. Column selection is the first step in feature engineering – isolating the variables relevant to your analysis or model.

wine_notes = df["notes"]
wine_notes.head()

named_ratings = df[["name", "rating"]]
named_ratings.head()

Specific DataFrame rows¶

Row filtering uses boolean conditions to select subsets of data. A comparison expression like df["rating"] > 96 produces a boolean Series, which when passed into bracket notation returns only the matching rows. This is the Pandas equivalent of a SQL WHERE clause and is used extensively in data exploration, outlier detection, and feature selection.

# filter rows based on a condition
top_named_ratings = named_ratings[df["rating"] > 96]
top_named_ratings.head()

# the condition can be used on its own
df['rating'] > 96

Using queries¶

The .query() method provides a string-based filtering syntax that is often more readable than boolean indexing, especially for compound conditions. You can combine conditions with & (AND), chain queries for complex multi-step filters, and use the engine='python' parameter to enable string method operations like .str.contains() within query expressions.

df.query("rating > 93").head(10)

# add more combined queries
df.query("rating > 94 & region == 'Ribera del Duero, Spain'").head(10)

# chain queries to use the Python engine to operate on strings
top_wines = df.query("rating > 95")
# na=False is needed because we have some NaN region descriptions!
paso_robles = top_wines.query("region.str.contains('Robles', na=False)", engine='python')