Run this notebook: Open in Colab Open in Kaggle

Pandas Usecase in R¶

We have to use dplyr library to solve pandas usecase in R. We will start importing typical data science library

options(warn=-1)
library(dplyr)
library(tidyverse)
library(lubridate)
library(zoo)
library(xts)
library('ggplot2')

Series¶

Series is like a list or 1D-array, but with index. All operations are index-aligned. Indexing of row in R we have to use row.names.

a<- 1:9

b = c("I","like","to","use","Python","and","Pandas","very","much")

a1 = length(a)
b1 = length(b)

a = data.frame(a,row.names = c(1:a1))
print(a)

b = data.frame(b,row.names = c(1:b1))
print(b)

One of the frequent usages of series is time series. In time series, the index has a special structure - typically a range of dates or datetimes. The easiest way to create time series using the ts function. But we will try another way to implement time series. We have to use the lubridate library to create an index of dates using the seq function.

Suppose we have a series that shows the amount of product bought every day, and we know that every Sunday we also need to take one additional item for ourselves. Here is how to model using series:

# We will use ggplot2 for visualizing the data
# If you want to change the plot size repr library will be used
library(repr)
options(repr.plot.width = 12,repr.plot.height=6)

start_date <- mdy("Jan 1, 2020")
end_date <- mdy("Dec 31, 2020")
idx = seq(start_date,end_date,by ='day')
print(paste("length of index is ",length(idx)))
size = length(idx)
sales = runif(366,min=25,max=50)
sold_items <- data.frame(row.names=idx[0:size],sales)
ggplot(sold_items,aes(x=idx,y=sales)) + geom_point(color = "firebrick", shape = "diamond", size = 2) +
    geom_line(color = "firebrick", size = .3)

We are merging additional_items and sold_items so that we can find the total no of products. As you can see, we are having problems here to find the total, we are getting NaN value as in the weekly series non-mentioned days are considered to be missing (NaN) if we add NaN to a number that gives us NaN. In order to do addition, we need to replace NAN with 0.

index = seq(start_date,end_date,by = 'week')
sz = length(index)
additional_product <- rep(10,53)
additional_items  <- data.frame(row.names = index[0:sz],additional_product)
additional_items
# we are merging two dataframe so that we can add
additional_item = merge(additional_items,sold_items, by = 0, all = TRUE)[-1] 
total  = data.frame(row.names=idx[0:size],additional_item$additional_product + additional_item$sales)
colnames(total) =  c('total')
total

additional_item[is.na(additional_item)] = 0
total  = data.frame(row.names=idx[0:size],additional_item$additional_product + additional_item$sales)
colnames(total) =  c('total')
total

ggplot(total,aes(x=idx,y=total)) + geom_point(color = "firebrick", shape = "diamond", size = 2) +
    geom_line(color = "firebrick", linetype = "dotted", size = .3)

We want to analyse total no of product in monthly basis.Thus, we find the mean of total no of product in a month and draw a bargraph

index = seq(start_date,end_date,by ='month')

x<- as.xts(total, dateFormat ="Date")
(monthly<-apply.monthly(x,mean))
ggplot(monthly, aes(x=index, y=total)) + 
  geom_bar(stat = "identity", width=5) 

DataFrame¶

Dataframe is essentially a collection of series with the same index. We can combine several series together into a dataframe. For example we are making dataframe of a and b series

a = data.frame(a,row.names = c(1:a1))

b = data.frame(b,row.names = c(1:b1))

df<- data.frame(a,b)
df

We can also rename the column name by using rename function

df = 
  rename(df,
    A = a,
    B = b,
  )

df

We can also select a column in a dataframe using select function

cat("Column A (series):\n")
select(df,'A')

We will extract rows that meet a certain logical criteria on series

df[df$A<5,]

df[df$A>5 & df$A<7,]

Creating a new columns.

Code below creates a series which calculates the divergence of a from its mean value then merging into a existing dataframe.

df$DivA <- df$A - mean(df$A)

df

We are creating a series which calculates the length of string of A column then merge into existing dataframe

df$LenB <- str_length(df$B)

df

Selecting rows based on numbers

df[0:5,]

Grouping means which groups the multiple columns based on certain conditions and we will use summarise function to see the difference

Suppose that we want to compute the mean value of column A for each given number of LenB. Then we can group our DataFrame by LenB, and find mean name them as a

df1 = df %>% group_by(LenB) %>% summarise(a = mean(A))

df1

df2 = df %>% group_by(LenB) %>%
summarise(MEAN = mean(A),count =length(DivA))

Printing and Plotting¶

When We call head(df) it will print out dataframe in a tabular form.

The first step of any data science project is data cleaning and visualization, thus it is important to visualize the dataset and extract some useful information.

#dataset = read.csv("file name")

head(df)

ggplot2 is a very good library as it simple to create complex plots from data in a data frame.

It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.

plot(df$A,type = 'o',xlab = "no",ylab = "A")

barplot(df$A, ylab = 'A',xlab = 'no')