Day 2 - Functions & Spark

January, 2018

Introduction

Day 1 - Getting started
Day 2 - Functions & Spark
Day 3 - Tidyverse
Day 4 - Plotly
Day 5 - Shiny Introduction
Day 6 - Reactivity
Day 7 - Modules
Day 8 - Shiny Project

Day 2 - Functions & Spark

Day 2 - Agenda

Functions
Spark

Functions

Example 1 - Hello World

myFunction<-function(){
  print("Hello World")
}
myFunction()

## [1] "Hello World"

Functions

Example 2 - with inputs

myFunction<-function(a,b=2){
  total<-a+b
  return(total)
}

myFunction(1,1)

## [1] 2

myFunction(1)

## [1] 3

Functions

Example 3 - using titanic data and glm function to fit a logistic regression

install.packages("titanic")
library(titanic)

fit<-glm( 
  data = titanic_train,
  formula = Survived ~ Sex + Age + Pclass,
  family = "binomial"
)

Functions

Example 4 - use 'rio' package to read and write (smallish) data from files

install.packages("rio")
data<-rio::import(file = "Data/titanic_train.csv",setclass = "tbl",integer64="double")
rio::export(x = titanic_train,file = "Data/titanic_train.csv")

Loops

In R there are many ways to repeat the same calculations many times

for(variable in sequence ){ Do something }
while( condition ){ Do something }
apply family of functions (I mostly use lapply)
purrr package (part of the tidyverse)

Spark & db

Data Science Toolchain

Download Spark

Download Spark
Create a Spark directory on your C drive (C:/Spark)
Unzip Spark into the Spark directory

Spark & db

When working with big data use Spark Spark is much faster than working with just R and can handle data that is of very very large size Note that not all R functions work in Spark

install.packages("sparklyr")
library(sparklyr)

spark_home_set("C:/Spark/spark-2.2.1-bin-hadoop2.7")
sc<-spark_connect(master="local") # Create a connection to spark

# Do all your analysis

spark_disconnect(sc)

Import

data<-spark_read_csv(
  sc,
  "titanic",
  "Data/titanic_train.csv",
  memory = FALSE,
  overwrite = TRUE
)

#import from R
import_iris<-copy_to(sc,iris,"spark_iris",overwrite=TRUE)

Exercise 1

Write a function (sim.pi) that takes one argument (iterations) with a default value of 1000
Generate two vectors (x,y) of length iterations which uniformly distributed between (-1,1)
Test whether if each of the coordinates fall inside the unit circle
HINT: ifelse( x^2 + y^2 <=1, TRUE, FALSE)
Count how many of the coordinates fell inside the unit circle (in)
return 4*in/iterations
Congratulations you estimated \(\pi\)!

Exercise 2

Remember the code we wrote on day one to simulate a loan. Make that code a function that accepts id as a numeric and PD as the probability of default term structure.
Use lapply and simulate 10 000 accounts
Use dplyr::bind_rows to combine all the accounts into one dataset
User rio::export to save the data we just created. We will continue using the data tomorrow when we start looking at data science in R.