Run this notebook: Open in Colab Open in Kaggle

import pandas as pd

df = pd.read_csv('employees.csv', parse_dates = ["Start Date", "Last Login Time"])
df.head(3)

df.head(10)

df.info()

df['Last Login Time'].head(3)

df['Last Login Time'] = pd.to_datetime(df['Last Login Time'])

df['Senior Management'] = df["Senior Management"].astype("bool")

df['Gender'] = df['Gender'].astype("category")

Filter a DataFrame Based on a Condition¶

Boolean filtering is the primary mechanism for subsetting data in Pandas. A comparison like df["Gender"] == "Male" produces a boolean Series (True/False for each row), which can then be passed into bracket notation to select only the matching rows. You can filter on equality, inequality, greater/less than, and date comparisons.

Why this matters: Filtering is how you segment data for analysis – selecting customers in a specific region, transactions above a threshold, or employees hired before a certain date. In ML workflows, conditional filtering is used to create training/test splits based on time, remove outliers, and isolate specific subpopulations for targeted modeling.

df["Gender"] == "Male"

df[df["Gender"] == "Male"]

df["Team"]

mask = df["Team"] == "Finance"
df[mask]

df[df["Senior Management"] == True]

df[df["Team"] != "Marketing"]

mask = df["Team"] != "Marketing"
df[mask]

df[df["Salary"] > 110000]

df[df["Bonus %"] < 1.5]

mask = df["Start Date"] <="1985-01-01"
df[mask]

Filter with More than One Condition (AND)¶

To combine multiple conditions with AND logic, use the & operator between boolean masks. Each individual condition must be wrapped in parentheses. The result includes only rows where all conditions are simultaneously true.

Why this matters: Real-world data queries almost always involve multiple criteria. For example, finding male employees in the Marketing department requires combining two filters. The mask variable pattern (mask1 = ...; mask2 = ...; df[mask1 & mask2]) keeps complex filters readable and debuggable, which is essential when building data pipelines that other team members need to maintain.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

mask1 = df["Gender"] == "Male"
mask2 = df["Team"] == "Marketing"

df[mask1 & mask2]

Filter with More than One Condition (OR)¶

The | operator combines conditions with OR logic, selecting rows where at least one condition is true. You can mix AND and OR using parentheses for grouping, and for checking membership in a list of values, the .isin() method provides a cleaner alternative to chaining multiple OR conditions.

Why this matters: OR filtering is common when you need to select multiple categories at once, such as pulling records for several departments or product lines. The .isin() method is especially valuable when the list of values comes from another data source or user input, making your filtering logic dynamic and scalable.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

mask1 = df["Senior Management"]
mask2 = df["Start Date"] < "1990-01-01"

df[mask1 | mask2]

mask1 = df["First Name"] == "Robert"
mask2 = df["Team"] == "Client Services"
mask3 = df["Start Date"] > "2016-06-01"

df[(mask1 & mask2)| mask3]

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

mask1 = df["Team"] == "Legal"
mask2 = df["Team"] == "Sales"
mask3 = df["Team"] == "Products"

df[mask1 | mask2 | mask3]

mask = df["Team"].isin(["Legal", "Product", "Sales", "Marketing"])
df[mask]

The .isnull() and .notnull() Methods¶

The .isnull() method returns True for every cell that contains a NaN (missing) value, while .notnull() returns True for cells that have valid data. When used as masks, these methods let you isolate rows with missing or complete data in specific columns.

Why this matters: Understanding where your data has gaps is a critical step in data quality assessment. Before training a model, you need to decide how to handle missing values – and the first step is finding them. Filtering with .isnull() helps you inspect the problematic rows, while .notnull() lets you proceed with only complete records for analyses that cannot tolerate missing values.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

mask = df["Team"].isnull()

df[mask]

condition = df["Gender"].notnull()

df[condition]

The .between() Method¶

The .between(low, high) method provides a concise way to filter values within a range, inclusive of both endpoints by default. It works with numeric, datetime, and string columns, replacing the need for compound conditions like (df["col"] >= low) & (df["col"] <= high).

Why this matters: Range-based filtering is ubiquitous in data analysis – selecting transactions within a date window, employees in a salary band, or sensor readings within normal operating parameters. The .between() method makes these queries more readable and less error-prone than manually combining two comparison operators.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

df[df["Salary"].between(60000,70000)]

df[df["Bonus %"].between(2.0, 5.0)]

df[df["Start Date"].between("1991-01-01", "1992-01-01")]

df[df["Last Login Time"].between("08:30 AM", "12:00 PM")]

The .duplicated() Method¶

The .duplicated() method returns a boolean Series indicating whether each value (or row) is a duplicate. The keep parameter controls which occurrence is marked: keep="first" marks all but the first, keep="last" marks all but the last, and keep=False marks all duplicates including the original. Using the ~ operator inverts the mask to find unique values instead.

Why this matters: Duplicate detection is a core data cleaning task. Duplicate records can skew statistical analyses, inflate counts, and cause models to overfit on repeated observations. Identifying duplicates before analysis ensures that your insights reflect the true underlying patterns in the data.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace=True)
df.head(3)

mask = ~df["First Name"].duplicated(keep=False)
df[mask]

The .drop_duplicates() Method¶

While .duplicated() identifies duplicates, .drop_duplicates() removes them in a single step. The subset parameter specifies which columns to consider when identifying duplicates, and keep determines which occurrence to retain. Using keep=False removes all duplicated rows entirely.

Why this matters: In ETL pipelines and data preparation workflows, deduplication is often an automated step applied before data enters a data warehouse or model training pipeline. The subset parameter is crucial – two employees might share a first name but differ in team, so you need to choose the right combination of columns to define what constitutes a true duplicate.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace = True)
df.head(3)

len(df)

df.drop_duplicates(subset = ["First Name"], keep = False)

df.drop_duplicates(["First Name", "Team"])

The .unique() and .nunique() Methods¶

The .unique() method returns an array of all distinct values in a column, while .nunique() returns just the count of distinct values. The dropna parameter in .nunique() controls whether NaN is counted as a unique value.

Why this matters: Knowing the cardinality (number of unique values) of a column informs critical modeling decisions. Low-cardinality columns like “Gender” are good candidates for one-hot encoding, while high-cardinality columns like “Name” may need different treatment. The .unique() output also helps you spot data quality issues such as inconsistent spellings or unexpected categories.

df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

df["Gender"].unique()

df["Team"].unique()

len(df["Team"].unique())

df["Team"].nunique(dropna = False)