Day 3 - Tidyverse

January, 2018

Introduction

Day 1 - Getting started
Day 2 - Functions & Spark
Day 3 - Tidyverse
Day 4 - Plotly
Day 5 - Shiny Introduction
Day 6 - Reactivity
Day 7 - Modules
Day 8 - Shiny Project

Day 3 - Tidyverse

Day 3 - Agenda

select
filter
arrange
mutate
summarise
group_by
%>% (pipe)

Data Science Toolchain

select

It’s not uncommon to get data sets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(diamonds, cut, color, carat, price)
select(diamonds, x:z)
select(diamonds, -(x:z))
select(diamonds, starts_with("c"))
select(diamonds, ends_with("e"))
select(diamonds, contains("r"))

TIP: Move sorting variables to the start of the data frame and only keep the important variables. Variables can be renamed at the same time.

filter

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

filter(diamonds, cut=="Ideal")
filter(diamonds, cut!="Ideal")
filter(diamonds, carat>=4) # <, >, ==, !=, <=, >=
filter(diamonds, cut=="Ideal" & carat>=4 )
filter(diamonds, cut=="Ideal" | carat>=4 )
filter(diamonds, cut %in% c("Ideal","Premium"))

sqrt(2)^2 == 2
near(sqrt(2)^2, 2)

arrange

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

arrange(diamonds, cut) #A-Z
arrange(diamonds, desc(cut)) #Z-A

arrange(diamonds, price) #Small to large
arrange(diamonds, desc(cprice)) #Large to small

arrange(diamonds, cut, desc(price)) #by two or more variables

mutate

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

Arithmetic: +, -, *, /, ^
Modular: %/% (integer division), %% (remainder), x == y * (x %/% y) + (x %% y)
Logs: log(), log2(), logn()
Offsets: lead(), lag()
Cumulatives: cumsum(), cumprod(), cummin(), cummax() See RcppRoll package for more.
Logical: <, <=, >, >=, !=, ==
Ranking: min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
User defined: function(){} – should be a vectorised function

mutate

TIP: Arithmetic operators are useful in conjunction with aggregate functions, e.g. X/sum(X) gives the proportion, and Y-mean(Y) computes the difference from the mean.
TIP: Offsets allows you to compute running differences (e.g. x-lag(x)) or find when values change (X != lag(X)) They are most useful in conjunction with group_by(), but make sure to sort first using arrange().

mutate(
  diamonds,
  price_p_carat = price / carat,
  diff = price_p_carat - mean(price_p_carat),
  z_score = diff / sd(price_p_carat)
)

summarise

The last key verb is summarise(). It collapses a data to a single row. summarise() is not terribly useful unless we pair it with group_by().

TIP: There are many built in functions so don’t reinvent the wheel.
TIP: The result of a summary can be used directly in the next step to calculate other statistics.
WARNING: Remember when calculating statistics that the result is not always as you would expect, e.g. mean() returns the straight average not the weighted average.
WARNING: Always check the documentation before using built in functions to know what options there are and what the default options are. It is important to understand exactly what you are calculating.

summarise

summarise(
  diamonds,
  N = n(),
  sum = sum(price),
  ave1 = sum / N,
  SSD = sum( (price - mean(price)) ^2),
  SD = sqrt( SSD / (n() -1) )
)

## # A tibble: 1 x 5
##       N       sum  ave1          SSD    SD
##   <int>     <int> <dbl>        <dbl> <dbl>
## 1 53940 212135217  3933 858473135517  3989

group_by

summarise() is not terribly useful unless we pair it with group_by(). When you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

TIP: group_by() is useful when calculating statistics per group. These statistics can then be easily compared.

TIP: Complicated models can also be built and then run on a group-by-group basis.

WARNING: When using group_by() with summarise() the groups get unwound after the summarise(). That means if you group by Var1 and Var2 after doing a summary the data frame will only be grouped by Var1. Thus the order of the variables used in the group_by() matter.

group_by

diamonds_grouped <- group_by(diamonds,cut)
summarise(
  diamonds_grouped,
  N = n(),
  average = mean(price),
  SD = sd(price)
)

## # A tibble: 5 x 4
##   cut           N average    SD
##   <ord>     <int>   <dbl> <dbl>
## 1 Fair       1610    4359  3560
## 2 Good       4906    3929  3682
## 3 Very Good 12082    3982  3936
## 4 Premium   13791    4584  4349
## 5 Ideal     21551    3458  3808

%>% (pipe)

%>% is used to string functions together. This makes writing a set of logic clear and condensed.

diamonds%>%
  group_by(color, clarity)%>%
  summarise(n = n())%>%
  mutate(prop=n/sum(n))%>%
  plot_ly( x = ~color, y = ~prop, color= ~clarity,type = "bar",colors = pal_deloitte)%>%
  layout(barmode = "stack")

%>% (pipe)

Exercise

Using the data we simulated yesterday and calculate the transistion matrix for each segment using Spark. A transition rate is defined as: \[p_{ij}=Pr({X_{t+1}=j|X_{t}=i})\] \[p_{ij}=\frac{\sum_n balance_{n,t} \times I(X_{n,t+1}=j|X_{n,t}=i)}{\sum_n balance_{n,t} \times I(X_{n,t}=i)}\] HINT: Make sure that your rows sum up to one