January, 2018

Introduction

  • Day 1 - Getting started
  • Day 2 - Functions & Spark
  • Day 3 - Tidyverse
  • Day 4 - Plotly
  • Day 5 - Shiny Introduction
  • Day 6 - Reactivity
  • Day 7 - Modules
  • Day 8 - Shiny Project

Day 2 - Functions & Spark

Day 2 - Agenda

  • Functions
  • Spark

Functions

Functions

Example 1 - Hello World

myFunction<-function(){
  print("Hello World")
}
myFunction()
## [1] "Hello World"

Functions

Example 2 - with inputs

myFunction<-function(a,b=2){
  total<-a+b
  return(total)
}

myFunction(1,1)
## [1] 2
myFunction(1)
## [1] 3

Functions

Example 3 - using titanic data and glm function to fit a logistic regression

install.packages("titanic")
library(titanic)

fit<-glm( 
  data = titanic_train,
  formula = Survived ~ Sex + Age + Pclass,
  family = "binomial"
)

Functions

Example 4 - use 'rio' package to read and write (smallish) data from files

install.packages("rio")
data<-rio::import(file = "Data/titanic_train.csv",setclass = "tbl",integer64="double")
rio::export(x = titanic_train,file = "Data/titanic_train.csv")

Loops

In R there are many ways to repeat the same calculations many times

  • for(variable in sequence ){ Do something }
  • while( condition ){ Do something }
  • apply family of functions (I mostly use lapply)
  • purrr package (part of the tidyverse)

Spark & db

Data Science Toolchain

Download Spark

  1. Download Spark
  2. Create a Spark directory on your C drive (C:/Spark)
  3. Unzip Spark into the Spark directory

Spark & db

When working with big data use Spark Spark is much faster than working with just R and can handle data that is of very very large size Note that not all R functions work in Spark

install.packages("sparklyr")
library(sparklyr)

spark_home_set("C:/Spark/spark-2.2.1-bin-hadoop2.7")
sc<-spark_connect(master="local") # Create a connection to spark

# Do all your analysis

spark_disconnect(sc)

Import

data<-spark_read_csv(
  sc,
  "titanic",
  "Data/titanic_train.csv",
  memory = FALSE,
  overwrite = TRUE
)

#import from R
import_iris<-copy_to(sc,iris,"spark_iris",overwrite=TRUE)

Exercise 1

  1. Write a function (sim.pi) that takes one argument (iterations) with a default value of 1000
  2. Generate two vectors (x,y) of length iterations which uniformly distributed between (-1,1)
  3. Test whether if each of the coordinates fall inside the unit circle
    HINT: ifelse( x^2 + y^2 <=1, TRUE, FALSE)
  4. Count how many of the coordinates fell inside the unit circle (in)
  5. return 4*in/iterations
  6. Congratulations you estimated \(\pi\)!

Exercise 2

  1. Remember the code we wrote on day one to simulate a loan. Make that code a function that accepts id as a numeric and PD as the probability of default term structure.
  2. Use lapply and simulate 10 000 accounts
  3. Use dplyr::bind_rows to combine all the accounts into one dataset
  4. User rio::export to save the data we just created. We will continue using the data tomorrow when we start looking at data science in R.