Run this notebook: Open in Colab Open in Kaggle

Basic Pandas Examples¶

This notebook will walk you through some very basic Pandas concepts. We will start with importing typical data science libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Series¶

Series is like a list or 1D-array, but with index. All operations are index-aligned.

a = pd.Series(range(1,10))
b = pd.Series(["I","like","to","use","Python","and","Pandas","very","much"],index=range(0,9))
print(a,b)

One of the frequent usage of series is time series. In time series, index has a special structure - typically a range of dates or datetimes. We can create such an index with pd.date_range.

Suppose we have a series that shows the amount of product bought every day, and we know that every sunday we also need to take one item for ourselves. Here is how to model that using series:

start_date = "Jan 1, 2020"
end_date = "Dec 31, 2020"
idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
items_sold = pd.Series(np.random.randint(25,50,size=len(idx)),index=idx)
items_sold.plot(figsize=(10,3))
plt.show()

additional_items = pd.Series(10,index=pd.date_range(start_date,end_date,freq="W"))
print(f"Additional items (10 item each week):\n{additional_items}")
total_items = items_sold+additional_items
print(f"Total items (sum of two series):\n{total_items}")

As you can see, we are having problems here, because in the weekly series non-mentioned days are considered to be missing (NaN), and adding NaN to a number gives us NaN. In order to get correct result, we need to specify fill_value when adding series:

total_items = items_sold.add(additional_items,fill_value=0)
print(total_items)
total_items.plot(figsize=(10,3))
plt.show()

monthly = total_items.resample("1M").mean()
ax = monthly.plot(kind='bar',figsize=(10,3))
ax.set_xticklabels([x.strftime("%b-%Y") for x in monthly.index], rotation=45)
plt.show()

DataFrame¶

A dataframe is essentially a collection of series with the same index. We can combine several series together into a dataframe. Given a and b series defined above:

df = pd.DataFrame([a,b])
df

We can also use Series as columns, and specify column names using dictionary:

df = pd.DataFrame({ 'A' : a, 'B' : b })
df

The same result can be achieved by transposing (and then renaming columns, to match the previous example):

pd.DataFrame([a,b]).T.rename(columns={ 0 : 'A', 1 : 'B' })

Selecting columns from DataFrame can be done like this:

print(f"Column A (series):\n{df['A']}")
print(f"Columns B and A (DataFrame):\n{df[['B','A']]}")

Selecting rows based on filter expression:

df[df['A']<5]

The way it works is that expression df['A']<5 returns a boolean series, which indicates whether expression is True or False for each elements of the series. When series is used as an index, it returns subset of rows in the DataFrame. Thus it is not possible to use arbitrary Python boolean expression, for example, writing df[df['A']>5 and df['A']<7] would be wrong. Instead, you should use special & operation on boolean series:

df[(df['A']>5) & (df['A']<7)]

Creating new computable columns. We can easily create new computable columns for our DataFrame by using intuitive expressions. The code below calculates divergence of A from its mean value.

df['DivA'] = df['A']-df['A'].mean()
df

What actually happens is we are computing a series, and then assigning this series to the left-hand-side, creating another column.

# WRONG: df['ADescr'] = "Low" if df['A'] < 5 else "Hi"
df['LenB'] = len(df['B']) # Wrong result

df['LenB'] = df['B'].apply(lambda x: len(x))
# or
df['LenB'] = df['B'].apply(len)
df

Selecting rows based on numbers can be done using iloc construct. For example, to select first 5 rows from the DataFrame:

df.iloc[:5]

Grouping is often used to get a result similar to pivot tables in Excel. Suppose that we want to compute mean value of column A for each given number of LenB. Then we can group our DataFrame by LenB, and call mean:

df.groupby(by='LenB').mean()

If we need to compute mean and the number of elements in the group, then we can use more complex aggregate function:

df.groupby(by='LenB') \
 .aggregate({ 'DivA' : len, 'A' : lambda x: x.mean() }) \
 .rename(columns={ 'DivA' : 'Count', 'A' : 'Mean'})

Printing and Plotting¶

Data Scientist often has to explore the data, thus it is important to be able to visualize it. When DataFrame is big, many times we want just to make sure we are doing everything correctly by printing out the first few rows. This can be done by calling df.head(). If you are running it from Jupyter Notebook, it will print out the DataFrame in a nice tabular form.

df.head()

We have also seen the usage of plot function to visualize some columns. While plot is very useful for many tasks, and supports many different graph types via kind= parameter, you can always use raw matplotlib library to plot something more complex. We will cover data visualization in detail in separate course lessons.

df['A'].plot()
plt.show()

df['A'].plot(kind='bar')
plt.show()

This overview covers most important concepts of Pandas, however, the library is very rich, and there is no limit to what you can do with it! Let’s now apply this knowledge for solving specific problem.