Run this notebook: Open in Colab Open in Kaggle

This notebook was prepared by Donne Martin. Source and license info is on GitHub.

Pandas¶

Credits: The following are notes taken while working through Python for Data Analysis by Wes McKinney

Series
DataFrame
Reindexing
Dropping Entries
Indexing, Selecting, Filtering
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indices with Duplicate Values
Summarizing and Computing Descriptive Statistics
Cleaning Data (Under Construction)
Input and Output (Under Construction)

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

Series¶

A Series is a one-dimensional labeled array built on top of a NumPy ndarray. Each element has a corresponding label (the index), which makes it behave like an ordered dictionary that also supports array-style operations. Series are the building blocks of DataFrames – every column in a DataFrame is a Series.

Understanding Series matters because almost every data manipulation in pandas – filtering, grouping, merging – ultimately operates on Series objects. You can create a Series from a list, a NumPy array, or a Python dictionary. The index can be integers (default), strings, dates, or any hashable type, which enables powerful label-based lookups that go far beyond positional indexing.

Creating a Series from a list¶

Passing a Python list to Series() creates an array with a default integer index starting at 0. The resulting Series supports vectorized operations, boolean filtering, and aggregation – all without explicit loops.

ser_1 = Series([1, 1, 2, -3, -5, 8, 13])
ser_1

Access the underlying NumPy array with .values. This is useful when you need to pass pandas data into libraries that expect raw arrays, such as scikit-learn estimators or NumPy math functions.

ser_1.values

Index objects are immutable and hold axis labels plus metadata like name. Immutability ensures that index labels cannot be accidentally changed, which is critical for maintaining data integrity when multiple Series or DataFrames share the same index. Below, .index returns a RangeIndex because no custom labels were specified.

ser_1.index

Custom string indices make data self-documenting. Instead of remembering that position 3 holds a specific metric, you can use a meaningful label like 'revenue' or 'temperature'. This is especially valuable when aligning data from different sources, since pandas will match on index labels automatically.

ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2

Series support both positional (integer) and label-based indexing. Both ser_2[4] and ser_2['e'] return the same element. This dual access pattern is convenient, but be careful with integer indices on a Series that also has integer labels – use .iloc[] for position and .loc[] for labels to avoid ambiguity.

ser_2[4] == ser_2['e']

Passing a list of index labels returns a sub-Series. This is the pandas equivalent of SQL’s SELECT – you pick exactly which labeled elements you need in the order you want them.

ser_2[['c', 'a', 'b']]

Boolean filtering is one of the most powerful patterns in pandas. The expression ser_2 > 0 produces a boolean Series, and using it as an index selects only the elements where the condition is True. This is analogous to a SQL WHERE clause and is the primary way to filter data in pandas workflows.

ser_2[ser_2 > 0]

Arithmetic on a Series is element-wise and vectorized. Multiplying by a scalar applies the operation to every element in a single optimized call, with no Python-level loop overhead.

ser_2 * 2

NumPy universal functions (ufuncs) like np.exp(), np.log(), and np.sqrt() work directly on Series objects, preserving the index. This interoperability means you can seamlessly move between the pandas and NumPy ecosystems – for example, applying mathematical transformations to feature columns before feeding them to a model.

import numpy as np
np.exp(ser_2)

Creating a Series from a dictionary¶

A Series behaves like a fixed-length, ordered dictionary – dictionary keys become index labels and values become the data. This makes it easy to convert lookup tables, configuration parameters, or API responses into Series for further analysis.

dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = Series(dict_1)
ser_3

Passing an explicit index parameter re-orders the Series and introduces NaN for any labels not found in the original dictionary. This is pandas’ way of handling missing data gracefully – rather than raising an error, it marks the gap with a sentinel value that propagates through computations.

index = ['foo', 'bar', 'baz', 'qux']
ser_4 = Series(dict_1, index=index)
ser_4

Detecting missing values is a fundamental step in any data cleaning pipeline. pd.isnull() is a top-level function that returns a boolean mask showing where NaN values exist. You can use this mask to count, filter, or fill missing entries.

pd.isnull(ser_4)

The .isnull() instance method on a Series is equivalent to pd.isnull(ser). Both return the same boolean mask. The instance method is often preferred for chaining, e.g., ser.isnull().sum() to count missing values.

ser_4.isnull()

Automatic index alignment in arithmetic¶

When you add two Series with different indices, pandas performs a union join on the labels. Only matching labels produce a computed result; unmatched labels yield NaN. This automatic alignment prevents silent mismatches that could corrupt your analysis, which is one of pandas’ key advantages over raw NumPy arrays.

ser_3 + ser_4

The .name attribute gives a Series a label that carries through operations and appears as the column header when the Series is placed inside a DataFrame.

ser_4.name = 'foobarbazqux'

The .index.name attribute labels the index axis, which is especially useful when exporting data or creating pivot tables where you need to distinguish between the index and data columns.

ser_4.index.name = 'label'

ser_4

You can rename index labels in place by assigning a new list. The list length must match the number of elements. This is a common step when cleaning messy or abbreviated labels from external data sources.

ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4

DataFrame¶

A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series’ copy method.

Create a DataFrame:

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

You can control column order by passing a columns list. Pandas will arrange the DataFrame columns in the specified sequence, which is helpful when preparing data for export or display.

df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Including a column name that does not exist in the source dictionary causes pandas to fill that column with NaN. This behavior mirrors how SQL outer joins produce NULL for missing matches and is key to understanding how pandas handles schema mismatches.

df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Bracket notation df['column'] returns the column as a Series. This is the standard way to select a single column and is preferred over attribute access because it works for all column names, including those with spaces or names that conflict with DataFrame methods.

df_3['state']

Dot notation (df.year) is a convenient shorthand that works when the column name is a valid Python identifier and does not shadow a DataFrame method. For production code, bracket notation is safer.

df_3.year

The .ix[] accessor retrieves a row by position (or label). Note that .ix is deprecated in modern pandas – use .iloc[] for integer-position-based access and .loc[] for label-based access instead.

df_3.ix[0]

Assigning a list, array, or scalar to a column updates every row in that column. The length of the assigned array must match the number of rows. This is commonly used when computing derived features like ratios, log-transforms, or encoded categories.

df_3['unempl'] = np.arange(5)
df_3

When assigning a Series to a column, pandas aligns on the index. Rows whose index does not appear in the Series receive NaN. This is different from assigning a list or array, which must match the DataFrame length exactly. This index-alignment behavior is powerful for merging partial updates into a larger dataset.

unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3

Assigning to a column name that does not yet exist creates that column. This is a quick way to add computed features during exploratory analysis – no need to pre-declare the schema.

df_3['state_dup'] = df_3['state']
df_3

The del statement removes a column in place. For a non-mutating alternative, use df.drop('col', axis=1) which returns a new DataFrame and leaves the original unchanged.

del df_3['state_dup']
df_3

Creating a DataFrame from nested dictionaries¶

When you pass a dict of dicts, the outer keys become column names and the inner keys become the row index. Pandas takes the union of all inner keys and sorts them, filling NaN where a key is missing. This mirrors how JSON-structured API responses are converted into tabular form.

pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
       'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = DataFrame(pop)
df_4

The .T property transposes the DataFrame, swapping rows and columns. This is useful for quick visual inspection when you have many columns but few rows, or when an algorithm expects features as rows rather than columns.

df_4.T

You can also build a DataFrame from a dict of Series objects. Pandas will align the Series by their indices, inserting NaN where indices do not overlap – the same alignment logic that makes pandas arithmetic safe.

data_2 = {'VA' : df_4['VA'][1:],
          'MD' : df_4['MD'][2:]}
df_5 = DataFrame(data_2)
df_5

Naming the index with .index.name adds a descriptive label to the row axis. This label appears in printed output and is preserved when writing to CSV or SQL, making the exported data self-documenting.

df_5.index.name = 'year'
df_5

Similarly, .columns.name labels the column axis. When both index.name and columns.name are set, the DataFrame has fully labeled axes, which is helpful for understanding pivot tables and multi-level indices.

df_5.columns.name = 'state'
df_5

The .values property extracts the DataFrame’s data as a 2-D NumPy ndarray. This is the bridge between pandas and any library that works with raw arrays, including scikit-learn’s fit() and predict() methods.

df_5.values

When columns have mixed dtypes (e.g., strings and numbers), the resulting NumPy array uses the object dtype to accommodate all types. This is less memory-efficient, so in practice you should select only numeric columns before converting to arrays for numerical computation.

df_3.values

Reindexing¶

Reindexing creates a new DataFrame (or Series) whose rows and columns conform to a new set of labels. Any labels in the new index that were not in the original are filled with NaN by default. This is a core operation for aligning datasets that were collected at different time points or with different schemas.

Create a new object with the data conformed to a new index. Any missing values are set to NaN.

df_3

Calling reindex() with a new list of row labels returns a new DataFrame with rows rearranged (or added) to match. Rows that existed in the original but are absent from the new index are dropped; new labels receive NaN.

df_3.reindex(list(reversed(range(0, 6))))

The fill_value parameter lets you substitute a custom constant (such as 0) instead of NaN for missing entries. This is useful when you know that absence of data implies a specific default.

df_3.reindex(range(6, 0), fill_value=0)

For ordered data like time series, the method parameter supports forward-fill ('ffill') and back-fill ('bfill'). Forward-fill propagates the last observed value into gaps, which is a standard approach for handling missing timestamps in financial or sensor data.

ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])

ser_5.reindex(range(5), method='ffill')

ser_5.reindex(range(5), method='bfill')

You can also reindex columns by passing the columns parameter. This reorders, adds, or drops columns to match the specified list – useful when you need to enforce a consistent schema across multiple DataFrames before concatenation.

df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])

Reindexing both rows and columns simultaneously, with a fill_value for new row entries, gives you full control over the DataFrame’s shape and default values in one operation.

df_3.reindex(index=list(reversed(range(0, 6))),
             fill_value=0,
             columns=['state', 'pop', 'unempl', 'year'])

The .ix[] accessor can also reindex in a single expression. Note again that .ix is deprecated – prefer .loc[] or .iloc[] combined with reindex() in modern pandas code.

df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]
df_6

Dropping Entries¶

Removing rows or columns is one of the most common data cleaning operations. The drop() method returns a new object with the specified labels removed, leaving the original untouched (unless inplace=True). In machine learning pipelines, you frequently drop irrelevant features or rows with too many missing values before training.

Drop rows by passing their index labels. By default, drop() operates on axis 0 (rows). The returned DataFrame excludes the specified rows while keeping all columns intact.

df_7 = df_6.drop([0, 1])
df_7

To drop columns, pass axis=1. This is the standard way to remove features you have determined to be irrelevant or redundant during exploratory data analysis.

df_7 = df_7.drop('unempl', axis=1)
df_7

Indexing, Selecting, and Filtering¶

Selecting subsets of data is arguably the most frequent operation in any data analysis workflow. Pandas offers multiple approaches: bracket notation for columns and boolean masks, .loc[] for label-based selection, and .iloc[] for integer-position selection. Mastering these patterns is essential for efficient data wrangling.

Series indexing mirrors NumPy array indexing but adds label-based access. You can select by integer position, by label, by boolean mask, or by a list of labels – all with the same bracket syntax.

ser_2

Selecting a single value by integer position or by label. Both return the same scalar when the integer index and the label map to the same element.

ser_2[0] == ser_2['a']

Slicing a Series by integer position works just like slicing a Python list. The start is inclusive and the stop is exclusive.

ser_2[1:4]

Passing a list of labels selects exactly those elements, in the order given. This is useful for reordering or subsetting a Series to match a specific requirement.

ser_2[['b', 'c', 'd']]

Boolean filtering returns only the elements where the condition evaluates to True. Combine multiple conditions with & (and), | (or), and ~ (not), wrapping each condition in parentheses.

ser_2[ser_2 > 0]

Label-based slicing with string labels is inclusive on both ends – 'a':'b' includes both 'a' and 'b'. This differs from integer-based slicing where the stop is exclusive, a common source of confusion for new pandas users.

ser_2['a':'b']

You can assign to a slice, and the values are broadcast to all selected elements. The label-based slice is again inclusive on both ends.

ser_2['a':'b'] = 0
ser_2

DataFrame indexing supports column selection, row slicing, and boolean filtering. The behavior depends on what you pass inside the brackets – a string or list of strings selects columns, while a slice or boolean array selects rows.

df_6

Passing a list of column names returns a DataFrame with only those columns. This is how you select your feature matrix X and target vector y when preparing data for a model.

df_6[['pop', 'unempl']]

A numeric slice inside brackets selects rows by integer position, similar to list slicing. The stop index is exclusive.

df_6[:2]

A boolean Series (produced by a comparison) selects only the rows where the condition is True. This is the DataFrame equivalent of SQL WHERE.

df_6[df_6['pop'] > 5]

Applying a scalar comparison to an entire DataFrame returns a boolean DataFrame of the same shape. Every cell shows whether that element passed the test.

df_6 > 5

Using a boolean DataFrame as a mask replaces elements that fail the test with NaN, keeping the shape intact. This is useful for examining outliers while preserving the original index structure.

df_6[df_6 > 5]

Using .ix[] (deprecated – prefer .loc[]) with a label slice selects rows inclusively on both ends, consistent with label-based slicing behavior.

df_6.ix[2:3]

You can combine row slicing with column selection in a single .ix[] call to extract a specific sub-region of the DataFrame.

df_6.ix[0:2, 'pop']

Combining boolean conditions with .ix[] lets you filter rows based on any column’s values. This pattern is the pandas equivalent of a parameterized SQL query.

df_6.ix[df_6.unempl > 5.0]

Arithmetic and Data Alignment¶

Pandas arithmetic always aligns operands by their index and column labels before computing. When labels do not match, the result contains NaN for the unmatched positions. This behavior is deliberate: it prevents you from accidentally adding row 5 of one dataset to row 5 of another when they represent different entities. Understanding alignment is critical for merging data from multiple sources correctly.

Adding two Series with partially overlapping indices produces results only for matched labels. Unmatched labels from either side become NaN in the output – a union-style join.

np.random.seed(0)
ser_6 = Series(np.random.randn(5),
               index=['a', 'b', 'c', 'd', 'e'])
ser_6

np.random.seed(1)
ser_7 = Series(np.random.randn(5),
               index=['a', 'c', 'e', 'f', 'g'])
ser_7

ser_6 + ser_7

The .add() method with fill_value=0 treats missing entries as zero before performing addition, so unmatched labels contribute their actual value instead of producing NaN. Similar fill methods exist for .sub(), .mul(), and .div().

ser_6.add(ser_7, fill_value=0)

DataFrame arithmetic works the same way but aligns on both row and column labels simultaneously. Only cells where both the row index and column name match produce a computed result; all other cells are NaN.

np.random.seed(0)
df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['a', 'b', 'c'])
df_8

np.random.seed(1)
df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['b', 'c', 'd'])
df_9

df_8 + df_9

Using .add() with fill_value=0 on DataFrames fills unmatched cells with zero before addition, producing a complete result without NaN for partial overlaps.

df_10 = df_8.add(df_9, fill_value=0)
df_10

Broadcasting between DataFrames and Series¶

When you subtract a Series from a DataFrame, pandas matches the Series index against the DataFrame columns and broadcasts the operation down every row. This is analogous to NumPy broadcasting but uses label alignment instead of shape rules.

ser_8 = df_10.ix[0]
df_11 = df_10 - ser_8
df_11

When the Series index does not fully overlap with the DataFrame columns, unmatched positions produce NaN. Pandas takes the union of labels, ensuring no data is silently dropped.

ser_9 = Series(range(3), index=['a', 'd', 'e'])
ser_9

df_11 - ser_9

To broadcast along rows (matching on the row index instead of columns), use an arithmetic method like .sub() with axis=0. This pattern is common when you need to normalize each row by a row-specific value, such as subtracting a row mean.

df_10

ser_10 = Series([100, 200, 300])
ser_10

df_10.sub(ser_10, axis=0)

Function Application and Mapping¶

Beyond built-in arithmetic, pandas lets you apply arbitrary functions to rows, columns, or individual elements. This bridges the gap between vectorized operations (fast but limited to pre-built functions) and pure Python flexibility (slow but unlimited). The three key methods are apply() (row or column level), applymap() (element-wise on a DataFrame), and map() (element-wise on a Series).

NumPy ufuncs like np.abs() work directly on DataFrames, applying the operation to every element while preserving the index and column structure. Prefer ufuncs over apply or applymap when a vectorized version exists, since they run at C speed.

df_11 = np.abs(df_11)
df_11

apply() passes each column (default axis=0) as a Series to your function and collects the results. Here, the lambda computes the range (max minus min) of each column – a useful summary statistic for understanding feature scales before normalization.

func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)

With axis=1, apply() passes each row as a Series. This is useful for row-level computations such as calculating a composite score from multiple feature columns.

df_11.apply(func_1, axis=1)

When your function returns a Series, apply() returns a DataFrame. This lets you compute multiple summary statistics (here, min and max) for each column in a single pass.

func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)

applymap() applies a function to every single element of a DataFrame. It is the most flexible but also the slowest approach, since each element goes through a Python function call. Use it for formatting (like rounding for display) or transformations that have no vectorized equivalent.

func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)

On a Series, the equivalent element-wise method is map(). It is commonly used for value replacement (mapping categories to numbers) or applying custom formatting to a single column.

df_11['a'].map(func_3)

Sorting and Ranking¶

Sorting is essential for inspecting top/bottom records, preparing data for merge operations, and generating ordered reports. Pandas can sort by index labels or by column values, in ascending or descending order. Ranking assigns an ordinal position to each element, which is useful for creating percentile features or tie-breaking logic in recommendation systems.

ser_4

.sort_index() sorts a Series alphabetically (or numerically) by its index labels. This is useful after filtering or grouping operations that may disorder the index.

ser_4.sort_index()

.sort_values() sorts by the actual data values rather than the index. Missing values (NaN) are placed at the end by default.

ser_4.sort_values()

df_12 = DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12

For DataFrames, sort_index() sorts by the row index. Pass axis=1 to sort columns by their names instead.

df_12.sort_index()

Sorting columns in descending order with ascending=False is handy for quick exploratory analysis where you want the most important features or largest values first.

df_12.sort_index(axis=1, ascending=False)

sort_values(by=...) accepts a column name or list of column names. When multiple columns are provided, the sort is hierarchical – ties in the first column are broken by the second column, and so on.

df_12.sort_values(by=['d', 'c'])

Ranking assigns each value an ordinal position (1-based). By default, tied values receive the mean of the ranks they would occupy. For example, if positions 2, 3, and 4 are tied, each gets rank 3.0. Ranking is commonly used to create percentile-based features or to implement non-parametric statistical tests.

ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.sort_values()
ser_11

ser_11.rank()

With method='first', ties are broken by the order in which values appear in the data, so each element gets a unique rank. This is useful when you need a strict ordering with no ties.

ser_11.rank(method='first')

Combining ascending=False with method='max' gives each tied element the highest rank in its group. Different tie-breaking methods ('min', 'max', 'first', 'dense', 'average') suit different analytical needs.

ser_11.rank(ascending=False, method='max')

DataFrame ranking operates column-wise by default (each column ranked independently). With axis=1, ranking operates row-wise, comparing values across columns for each row – useful for determining which feature has the highest value per observation.

df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
                   'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
                   'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13

Ranking each column independently. Each column’s values are compared only with other values in the same column.

df_13.rank()

Ranking across columns (axis=1) compares the values in each row. For row i, foo, bar, and baz are ranked relative to each other.

df_13.rank(axis=1)

Axis Indexes with Duplicate Values¶

While unique indices are best practice, pandas does not require them. Duplicate index labels are common in raw data – for example, multiple transactions on the same date. Selecting a duplicated label returns all matching rows as a DataFrame (or Series), rather than a single scalar.

When you create a Series with duplicate index labels, the .is_unique property returns False. Knowing this upfront prevents surprises when indexing later.

ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12

ser_12.index.is_unique

Indexing with a duplicated label returns a Series (multiple values) rather than a scalar. If the label is unique, a scalar is returned. Always check index.is_unique if your code assumes scalar returns.

ser_12['foo']

Similarly, selecting a duplicated label from a DataFrame returns multiple rows as a DataFrame rather than a single row Series.

df_14 = DataFrame(np.random.randn(5, 4),
                  index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14

df_14.ix['bar']

Summarizing and Computing Descriptive Statistics¶

Descriptive statistics provide a quick understanding of your data’s central tendency, spread, and shape. Pandas methods like sum(), mean(), describe(), and std() automatically skip NaN values by default, so they work correctly even on incomplete datasets. Running describe() immediately after loading a new dataset is one of the most common first steps in exploratory data analysis.

By default, pandas descriptive statistics skip NaN values. This means sum() and mean() return meaningful results even when some cells are missing – unlike NumPy, which propagates NaN through aggregations unless you use np.nanmean() explicitly.

df_6

df_6.sum()

Passing axis=1 sums across columns for each row. This is useful for computing row-level totals, such as the total spending per customer across multiple product categories.

df_6.sum(axis=1)

Setting skipna=False forces pandas to propagate NaN through the computation. If any value in a row (or column) is NaN, the result is NaN. This is useful when missing data should invalidate the entire aggregation rather than be silently ignored.

df_6.sum(axis=1, skipna=False)

Cleaning Data¶

Data cleaning is where data scientists spend the majority of their time. The key operations are replacing values (correcting errors or standardizing labels), dropping rows or columns (removing irrelevant or corrupt data), and concatenating DataFrames (combining data from multiple sources into a single table).

from pandas import Series, DataFrame
import pandas as pd

Setting up a sample DataFrame with state abbreviations and population data to demonstrate common cleaning operations.

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Replace¶

replace() with inplace=True modifies the DataFrame in place, substituting all occurrences of the old value with the new value across every column. This is commonly used to standardize inconsistent labels – for example, converting abbreviations to full names.

df_1.replace('VA', 'VIRGINIA', inplace=True)
df_1

Passing a nested dictionary to replace() lets you target specific columns. The outer key is the column name and the inner dict maps old values to new values. This is useful when a value like 'MD' should only be replaced in the 'state' column, not elsewhere.

df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)
df_1

Drop¶

drop() with axis=1 removes the specified column and returns a copy. The original DataFrame is unaffected unless you use inplace=True. Dropping columns is a standard step when removing features that are identifiers, constants, or have been replaced by engineered alternatives.

df_2 = df_1.drop('population', axis=1)
df_2

Concatenate¶

pd.concat() stacks DataFrames vertically (by default) or horizontally (axis=1). It is the primary tool for combining data that shares the same schema – for example, appending monthly sales reports into one annual dataset. Pandas aligns columns by name, inserting NaN where columns do not match.

data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}
df_3 = DataFrame(data_2)
df_3

df_4 = pd.concat([df_1, df_3])
df_4

Input and Output¶

Reading data into pandas and writing results back out are bookend operations in every analysis. Pandas supports dozens of formats – CSV, Excel, SQL, JSON, Parquet, HDF5 – making it the universal translator between data sources and your Python code.

from pandas import Series, DataFrame
import pandas as pd

Reading¶

pd.read_csv() is the most commonly used data loading function. By default it treats the first row as column headers and infers data types. Use sep='\t' for tab-separated files, encoding='latin1' for non-UTF-8 files, and parse_dates=['col'] to auto-parse date columns.

df_1 = pd.read_csv("../data/ozone.csv")

describe() returns count, mean, standard deviation, min, max, and quartile values for every numeric column. It is typically the first function called after loading a new dataset, giving you a rapid sense of scale, distribution, and potential issues like zero variance or extreme outliers.

df_1.describe()

head() returns the first five rows by default (pass an integer for a different number). It is a lightweight way to inspect the structure and content of your data without printing the entire DataFrame.

df_1.head()

Writing¶

to_csv() writes the DataFrame to a CSV file. Setting index=False omits the row index (which is usually just 0, 1, 2, …) and header=False omits column names. The encoding parameter ensures the file is written in the specified character set, which matters for international datasets with non-ASCII characters.

df_1.to_csv('../data/ozone_copy.csv', 
            encoding='utf-8', 
            index=False, 
            header=False)

Listing the data directory verifies that the output file was created. In production workflows, you would typically check file size and row counts rather than relying on directory listings.

!ls -l ../data/