- Day 1 - Getting started
- Day 2 - Functions & Spark
- Day 3 - Tidyverse
- Day 4 - Plotly
- Day 5 - Shiny Introduction
- Day 6 - Reactivity
- Day 7 - Modules
- Day 8 - Shiny Project
January, 2018
It’s not uncommon to get data sets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select(diamonds, cut, color, carat, price) select(diamonds, x:z) select(diamonds, -(x:z)) select(diamonds, starts_with("c")) select(diamonds, ends_with("e")) select(diamonds, contains("r"))
TIP: Move sorting variables to the start of the data frame and only keep the important variables. Variables can be renamed at the same time.
filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
filter(diamonds, cut=="Ideal") filter(diamonds, cut!="Ideal") filter(diamonds, carat>=4) # <, >, ==, !=, <=, >= filter(diamonds, cut=="Ideal" & carat>=4 ) filter(diamonds, cut=="Ideal" | carat>=4 ) filter(diamonds, cut %in% c("Ideal","Premium")) sqrt(2)^2 == 2 near(sqrt(2)^2, 2)
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
arrange(diamonds, cut) #A-Z arrange(diamonds, desc(cut)) #Z-A arrange(diamonds, price) #Small to large arrange(diamonds, desc(cprice)) #Large to small arrange(diamonds, cut, desc(price)) #by two or more variables
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
TIP: Arithmetic operators are useful in conjunction with aggregate functions, e.g. X/sum(X) gives the proportion, and Y-mean(Y) computes the difference from the mean.
TIP: Offsets allows you to compute running differences (e.g. x-lag(x)) or find when values change (X != lag(X)) They are most useful in conjunction with group_by(), but make sure to sort first using arrange().
mutate( diamonds, price_p_carat = price / carat, diff = price_p_carat - mean(price_p_carat), z_score = diff / sd(price_p_carat) )
The last key verb is summarise(). It collapses a data to a single row. summarise() is not terribly useful unless we pair it with group_by().
summarise( diamonds, N = n(), sum = sum(price), ave1 = sum / N, SSD = sum( (price - mean(price)) ^2), SD = sqrt( SSD / (n() -1) ) )
## # A tibble: 1 x 5 ## N sum ave1 SSD SD ## <int> <int> <dbl> <dbl> <dbl> ## 1 53940 212135217 3933 858473135517 3989
summarise() is not terribly useful unless we pair it with group_by(). When you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.
TIP: group_by() is useful when calculating statistics per group. These statistics can then be easily compared.
TIP: Complicated models can also be built and then run on a group-by-group basis.
WARNING: When using group_by() with summarise() the groups get unwound after the summarise(). That means if you group by Var1 and Var2 after doing a summary the data frame will only be grouped by Var1. Thus the order of the variables used in the group_by() matter.
diamonds_grouped <- group_by(diamonds,cut) summarise( diamonds_grouped, N = n(), average = mean(price), SD = sd(price) )
## # A tibble: 5 x 4 ## cut N average SD ## <ord> <int> <dbl> <dbl> ## 1 Fair 1610 4359 3560 ## 2 Good 4906 3929 3682 ## 3 Very Good 12082 3982 3936 ## 4 Premium 13791 4584 4349 ## 5 Ideal 21551 3458 3808
%>% is used to string functions together. This makes writing a set of logic clear and condensed.
diamonds%>% group_by(color, clarity)%>% summarise(n = n())%>% mutate(prop=n/sum(n))%>% plot_ly( x = ~color, y = ~prop, color= ~clarity,type = "bar",colors = pal_deloitte)%>% layout(barmode = "stack")
Using the data we simulated yesterday and calculate the transistion matrix for each segment using Spark. A transition rate is defined as: \[p_{ij}=Pr({X_{t+1}=j|X_{t}=i})\] \[p_{ij}=\frac{\sum_n balance_{n,t} \times I(X_{n,t+1}=j|X_{n,t}=i)}{\sum_n balance_{n,t} \times I(X_{n,t}=i)}\] HINT: Make sure that your rows sum up to one