Machine Learning 101

" />

# Machine Learning 101
## Supervised Learning in R
### Sarah Romanes   &nbsp;@sarah_romanes
### 10-Oct-2018 &nbsp;bit.ly/rladies-sydney-ML-1

---

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Welcome!]
]
]]

---

.column.bg-main1[.content.vmiddle.center[

# Overview

### This two part R-Ladies workshop is designed to give you a small taster into the large field that is known as .orange[**Machine Learning**]! Today, we will cover supervised learning techniques (explained later), and next week we will cover model performance assessment.

### For a more indepth explanation of topics covered, please read the *free* Introduction to Statistical Learning textbook (James, Witten, Hastie, and Tibshirani) [here](http://www-bcf.usc.edu/~gareth/ISL/).

]]
.column.bg-main3[.content.vmiddle.center[
<center>
 <img src="images/ISLR.jpg", width="70%">

]]

---

---

# .purple[What *is* Machine Learning?]

### Machine learning is concerned with finding functions `$Y=f(X)+\epsilon$` that best **predict** outputs (responses), given data inputs (predictors).

</center>

### Mathematically, Machine Learning problems are simply *optimisation* problems, in which we will use .purple[] to help us solve!

---

# .purple[Why do Machine Learning in ?]

</center>

---

.column.bg-main1[.content[

# Types of Learning

## .orange[Supervised Learning]

### - We have knowledge of class labels or values.

### - Goal: train a model using known class labels to predict class or value label for a new data point.

## .orange[Unsupervised Learning]

### - No knowledge of output class or value –data is unlabelled.

### - Goal: determine data patterns/groupings.

]]

##### .purple[Credit to Towards Data Science]

]]

---

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Regression]
]
]]

---

.column.bg-main1[.content[

# A simple prediction problem

### Consider the simulated dataset `Income`, which looks at the relationship between `Education` (years) and `Income` (thousands).

```r
data <- read.csv("data/Income.csv")
head(data)
```

```
##   Education   Income
## 1  10.00000 26.65884
## 2  10.40134 27.30644
## 3  10.84281 22.13241
## 4  11.24415 21.16984
## 5  11.64548 15.19263
## 6  12.08696 26.39895
```

### What shape is the relationship? Can we build a function to predict the value of a new data point?

]]

]]

---

## Linear Regression to the Rescue!

---

# .purple[What line is optimal?]

### We can use a .purple[**linear function**] to describe our data and .purple[**predict**] a new data point. The question is, what is the .purple[*optimal line*] ?

### One way to do this is find the slope and intercept that .purple[*minimise*] the sum of the squared residuals between the line and our data points:

### We can visualise this optimisation process using a .purple[**Shiny App**] [here](http://43.240.99.178:3838/sample-apps/LinearRegression/).

### .purple[] performs this optimisation process for us when we call the `lm` function.

###### (PS) see an alternative derivation [here](https://sarahromanes.github.io/post/gganimate/)!
---

.column.bg-main1[.content[
# We can fit a linear model in using the `lm` function as follows:

```r
*fit <- lm(data=data,
 Income ~ Education)
```
]]
.column.bg-main3[.content.vmiddle.center[
# This tells the `lm` function what data we are referring to.
]]

---

.column.bg-main1[.content[
# We can fit a linear model in using the `lm` function as follows:

```r
fit <- lm(data=data, 
* Income ~ Education)
```
]]
.column.bg-main3[.content.vmiddle.center[
## This tells the `lm` function what variables we would like to regress.

### R expects the relationship in the form of `response~predictors`. 
]]

---

.column.bg-main1[.content[
# We can fit a linear model in using the `lm` function as follows:

```r
fit <- lm(data=data, 
 Income ~ Education)
*summary(fit)
```

```r
Call:
lm(formula = Income ~ Education, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.046  -2.293   0.472   3.288  10.110

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
*(Intercept) -39.4463 4.7248 -8.349 4.4e-09 ***
Education 5.5995 0.2882 19.431 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.653 on 28 degrees of freedom
Multiple R-squared: 0.931,	Adjusted R-squared: 0.9285 
F-statistic: 377.6 on 1 and 28 DF, p-value: < 2.2e-16
```

]]
.column.bg-main3[.content.vmiddle.center[

## We can use the `summary` function to examine regression coefficients, and information about the residuals of our model.
]]

---

.column.bg-main1[.content[
# We can fit a linear model in using the `lm` function as follows:

```r
fit <- lm(data=data, 
 Income ~ Education)
*summary(fit)
```

```r
Call:
lm(formula = Income ~ Education, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.046  -2.293   0.472   3.288  10.110

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -39.4463 4.7248 -8.349 4.4e-09 *** 
*Education 5.5995 0.2882 19.431 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.653 on 28 degrees of freedom
Multiple R-squared: 0.931,	Adjusted R-squared: 0.9285 
F-statistic: 377.6 on 1 and 28 DF, p-value: < 2.2e-16
```

]]
.column.bg-main3[.content.vmiddle.center[

## We can use the `summary` function to examine regression coefficients, and information about the residuals of our model.
]]

---

.column.bg-main1[.content[
# We can fit a linear model in using the `lm` function as follows:

```r
fit <- lm(data=data, 
 Income ~ Education)

New_Data <- data.frame(Education = c(15, 18))
*predict(fit, New_Data)
```

```
##        1        2 
## 44.54599 61.34444
```

]]
.column.bg-main3[.content.vmiddle.center[

## Using the `predict` function, we can predict the `Income` of a new `Education` value (or values). 
]]

---

.column.bg-main1[.content[

# Modelling Binary Outcomes

### Consider the dataset `Spiders`, Suzuki et al. (2006) , from a study which looked at the relationship between `GrainSize` of sand and `Spiders` presence.

```r
data <- read.csv("data/Spiders.csv")
head(data,3)
```

```
##   GrainSize Spiders
## 1     0.245       0
## 2     0.247       0
## 3     0.285       1
```

### Can we use `lm` to predict the class for a new data point?

]]

]]

---

.column.bg-main1[.content[

## To model **binary data**, we need to .orange[link] our **predictors** to our response using a *link function*.

### Instead of predicting the outcome directly, we instead predict the probability of being class 1, given the linear combination of predictors, as follows:

$$ p(y=1|\\beta_0 + \\beta_1 x)  = \\sigma(\\beta_0 + \\beta_1 x) $$

For the logistic (`logit`) link

$$ p(y=1|\\beta_0 + \\beta_1 x)  = \\frac{1}{1+ \\exp( - (\\beta_0 + \\beta_1 x))} $$

For the probit (`probit`) link

$$ p(y=1|\\beta_0 + \\beta_1 x)  = \\Phi(\\beta_0 + \\beta_1 x) $$

]]
.column.white[.content.vmiddle.center[
<img src="index_files/figure-html/unnamed-chunk-13-1.png" width="504" />

]]

---

.column.bg-main1[.content[
# We can fit a glm in using the `glm` function as follows:

```r
*fit <- glm(data=data,
 Spiders~GrainSize,
 family=binomial(link="logit")) 
```
]]
.column.bg-main3[.content.vmiddle.center[
# This tells the `glm` function what data we are referring to.
]]

---

.column.bg-main1[.content[
# We can fit a glm in using the `glm` function as follows:

```r
fit <- glm(data=data, 
* Spiders~GrainSize,
 family=binomial(link="logit")) 
```
]]
.column.bg-main3[.content.vmiddle.center[
## This tells the `glm` function what variables we would like to regress. 
### Just like the `lm` function, R expects the relationship in the form of `response~predictors`. 
]]

---

.column.bg-main1[.content[
# We can fit a glm in using the `glm` function as follows:

```r
fit <- glm(data=data, 
 Spiders~GrainSize, 
* family=binomial(link="logit"))
```

```r
fit <- glm(data=data, 
 Spiders~GrainSize, 
* family=binomial(link="probit"))
```

```r
fit <- glm(data=data, 
 Spiders~GrainSize, 
* family=binomial(link="cloglog"))
```

]]
.column.bg-main3[.content.vmiddle.center[
## This tells the `glm` function how we would like to model our response. For **binary** response data, we use the `binomial` family.

## Further, there are many ways we can link our linear combination of predictors to the 0,1 space. 
]]

---

.column.bg-main1[.content[
# We can fit a glm in using the `glm` function as follows:

```r
fit <- glm(data=data, 
 Spiders~GrainSize, 
 family=binomial(link="logit"))
summary(fit)
```

```r
Call:
glm(formula = Spiders ~ GrainSize, family = binomial(link = "logit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7406  -1.0781   0.4837   0.9809   1.2582

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
*(Intercept)   -1.648      1.354  -1.217   0.2237
*GrainSize      5.122      3.006   1.704   0.0884 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
]]
.column.bg-main3[.content.vmiddle.center[
## Similar to the `lm` function, we can use the `summary` function to examine regression coefficients.
]]

---

.column.bg-main1[.content[
# We can fit a glm in using the `glm` function as follows:

```r
fit <- glm(data=data, 
 Spiders~GrainSize, 
 family=binomial(link="logit"))

New_Data <- data.frame(GrainSize = c(0.4, 0.87))
*probs <- predict(fit, New_Data,type="response")
probs 
```

```
##         1         2 
## 0.5989270 0.9431134
```

```r
*round(probs)
```

```
## 1 2 
## 1 1
```
]]
.column.bg-main3[.content.vmiddle.center[
## And we can also use the `predict` function to estimate class membership probability, as well as use the `round` command to estimate class.
]]

---

## Of course, probabilities close to 0.5 are hard to classify!!

---

# A warning for using GLMs!

---

```r
data.new <- read.csv("data/SpidersWarning.csv")
ggplot(data.new, 
 aes(x=GrainSize, y=Spiders)) +
 geom_point(col="red", size=3)
```

<img src="index_files/figure-html/unnamed-chunk-23-1.png" width="504" />
]]

.column.bg-main3[.content.vmiddle.center[
## Suppose the scientist who collected their data thought the study would look more convincing if the data was **perfectly** sepatated, and sneakily modified their results. How would their glm look?
]]

---

```r
fit <- glm(data=data.new, 
 Spiders~GrainSize, 
 family=binomial(link="logit"))
```

```
## Warning: glm.fit: algorithm did not converge
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
```

]]

.column.bg-main3[.content.vmiddle.center[
## Warnings appear when we try and fit this glm...
]]

---

```r
fit <- glm(data=data.new, 
 Spiders~GrainSize, 
 family=binomial(link="logit"))
```

```
## Warning: glm.fit: algorithm did not converge
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
```

```r
*summary(fit)
```

```r
Call:
glm(formula = Spiders ~ GrainSize, family = binomial(link = "logit"), 
    data = data.new)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-8.087e-05  -2.100e-08  -2.100e-08   2.100e-08   7.488e-05

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
*(Intercept)   -912.4   362618.2  -0.003    0.998
*GrainSize     1569.2   624478.6   0.003    0.998
```

]]

.column.bg-main3[.content.vmiddle.center[
## Warnings appear when we try and fit this glm...

## ...and, when we look at the `summary`, we can see our regression coefficients are blowing up, as well as the standard errors associated with them! 
]]

---

---

.column.bg-main1[.content[

# Multiple Regression

### Can we fit a model with more than one predictor? Of course we can! Consider the dataset `Exam`, where two exam scores are given for each student, and a Label represents whether they passed or failed the course.

```r
data<- read.csv("data/Exam.csv", header=T)
head(data,4)
```

```
##      Exam1    Exam2 Label
## 1 34.62366 78.02469     0
## 2 30.28671 43.89500     0
## 3 35.84741 72.90220     0
## 4 60.18260 86.30855     1
```

]]

]]

---

.column.bg-main1[.content[
# We can fit the glm in using the `glm` function as follows:

```r
fit <- glm(data=data, 
* Label ~ .,
 family=binomial(link="logit")) 
summary(fit)
```

```r
Call:
glm(formula = Label ~ ., family = binomial(link = "logit"), 
    data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.19287  -0.18009   0.01577   0.19578   1.78527

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -25.16133    5.79836  -4.339 1.43e-05 ***
Exam1         0.20623    0.04800   4.297 1.73e-05 ***
Exam2         0.20147    0.04862   4.144 3.42e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
]]
.column.bg-main3[.content.vmiddle.center[
### The formula `Label ~ . ` tells the `glm` function to regress against all (two) predictors, `Exam1` and `Exam2`.
]]

---

.column.bg-main1[.content[

# The Decision Boundary

### We can plot the **decision boundary** on our scatterplot for `Exam1` vs `Exam2` by looking at the set of points for which our classifier predicts `$$p(y=1|\beta_0 + \beta_1 \text{Exam1} + \beta_2 \text{Exam2}) \geq 0.5$$`

#### Given the `logit` link, this is true when 
`$$\beta_0 + \beta_1 \text{Exam1} + \beta_2 \text{Exam2} \geq 0$$`

#### This is the **decision boundary** and can be rearranged as 
$$ \text{Exam2} \geq \frac{-\beta_0}{\beta_2} + \frac{-\beta_1}{\beta_2} \text{Exam 1} $$

]]

]]

---

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Classification]
]
]]

---

---

.column.bg-main1[.content[

# When we can't use GLMs

### Consider the dataset `Microchips`, containing test scores `Test1` and `Test2`, and `Label` indicating whether or not the chip passed the test.

```r
data <- read.csv("data/Microchips.csv")
head(data,4)
```

```
##       Test1   Test2 Label
## 1  0.051267 0.69956     1
## 2 -0.092742 0.68494     1
## 3 -0.213710 0.69225     1
## 4 -0.375000 0.50219     1
```

### Why can't we use `glm` to predict the class for a new data point?
]]

]]

---

# .purple[Different data types require different machine learning methods]

## While we can indeed use Logistic regression to .purple[**classify**] data points, this simply isn't feasible when:

### - We have high class seperation in our data
### - We have a non-linear combination of predictors influcing our response (as in this case)

## .purple[**So, what other options do we have?**]

---
class: split-two white

.column.bg-main1[.content[

# The K-Nearest Neighbours Algorithm

## .orange[**K Nearest Neighbours (KNN)**] is a non-parametric algorithm (doesn't assume the data follows any particular shape).

## In KNN, we vote on the class of a new data point by looking at the majority class of the .orange[**K**] nearest neighbours.

]]

##### .purple[Credit to Natasha Latysheva]

]]

---

.column.bg-main1[.content[

# KNN in

```r
X <- data[,1:2]
cl <- as.factor(data[,3])
```
]]

.column.bg-main3[.content.vmiddle.center[
## First, we split our data into **predictors** (`X`), and response (`cl` for *class*),

]]

---

.column.bg-main1[.content[

# KNN in

### Next, we need to give the function new data points, as the `knn` function fits and predicts at the same time. We will use the `MASS` package to generate a grid of new points to predict the class of.

```r
X <- data[,1:2]
cl <- as.factor(data[,3])

library(MASS)
*new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5),
* by=0.1),
* y=seq(min(X[,2]-0.5), max(X[,2]+0.5),
* by=0.1))
```

]]

]]

---

.column.bg-main1[.content[

# KNN in

```r
X <- data[,1:2]
cl <- as.factor(data[,3])

library(MASS)
new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5),
 by=0.1),
 y=seq(min(X[,2]-0.5), max(X[,2]+0.5), 
 by=0.1))
 
library(class)
*classif <- knn(X, new.X, cl, k = 3, prob=TRUE)
```
]]

.column.bg-main3[.content.vmiddle.center[

## Now we call the `knn` function, putting in our datapoints, our new data, and the classes of the original data points, as well as how many neighbours we want to consider. In this example, we consider `k=3`.
]]

---

.column.bg-main1[.content[

# KNN in

```r
X <- data[,1:2]
cl <- as.factor(data[,3])

library(MASS)
new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5),
 by=0.1),
 y=seq(min(X[,2]-0.5), max(X[,2]+0.5), 
 by=0.1))
 
library(class)
classif <- knn(X, new.X, cl, k = 3, prob=TRUE)
```

We can plot the decision boundaries, as well as the classification of our new points, as follows. Detailed R code is in slide raw files.

</center>

]]

<img src="index_files/figure-html/unnamed-chunk-40-1.png" width="504" />
]]

---

.column.bg-main1[.content[

# KNN in

```r
X <- data[,1:2]
cl <- as.factor(data[,3])

We can also see how our decisions change based on the number of neighbours we consider! Here, we consider `k=9`.

</center>

]]

<img src="index_files/figure-html/unnamed-chunk-42-1.png" width="504" />
]]

---

## What other classifiers exist besides KNN?

---

# .purple[Decision Trees]

]]

.column.bg-main1[.content.vmiddle.center[

### A Decision Tree determines the predictive value based on series of questions and conditions.

###  When making Decision Trees, there are several factors we take into consideration. These include - on what features do we make our decisions on? What is the threshold for classifying each question into a .orange[**yes**] or .orange[**no**] answer?

Image Source: [Egor Dezhic](https://becominghuman.ai)

]]

---

# .purple[Decision Trees in using `rpart`]

## Consider the `iris` dataset

```r
head(iris)
```

```
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
```

## How can we construct a decision tree to determine the `Species` of `iris`?

---

.column.bg-main1[.content[

# Trees in

```r
library(rpart)
*tree <- rpart(Species ~ ., data = iris, method = "class")
```
]]

.column.bg-main3[.content.vmiddle.center[
## Using the `rpart` function, we build a classification tree, using the formula `Species~.` to indicate that we wish to use all the variables to predict our class.
]]

---

.column.bg-main1[.content[

# Trees in

```r
library(rpart)
tree <- rpart(Species ~ ., data = iris, method = "class")

*library(rpart.plot)
*rpart.plot(tree)
```

### Futher, using the `rpart.plot` function from the `rpart.plot` library, we can visualise the results of our classification tree.

### Each node shows the predicted class, the predicted probability of each class, and the percentage of observations in each node.

]]

]]

---

.column.bg-main1[.content[

# Trees in

```r
library(rpart)
tree <- rpart(Species ~ ., data = iris, method = "class")

library(rpart.plot) 
rpart.plot(tree)
```

```r
New_Data <-data.frame(Sepal.Length= 5.0, Sepal.Width = 3.9,
 Petal.Length = 1.4, Petal.Width = 0.3)
*predict(tree, New_Data)
```

```
##   setosa versicolor virginica
## 1      1          0         0
```

#### Using `predict`, we can predict the class of a new data point, much like in the previous examples. Alternatively - we could have predicted using the tree ourselves!

]]

]]

---

# A single tree is prone to overfitting

</center>

---

# .purple[Ensemble Learning: Bagging]

---

.column.bg-main1[.content[

# From bagging to Random Forest

### With Random Forest we randomly select a subset of predictors to use for each tree.

### The reason? Tree correlation.

### If one or a few features are very strong predictors for the response variable, then these features will be selected in many of the trees, causing them to become correlated. .orange[This can reduce the accuracy of the ensemble.]

]]

###### .black[Source:] [GSS](http://www.globalsoftwaresupport.com/random-forest-classifier-bagging-machine-learning/)

]]

---

.column.bg-main1[.content[

# `randomForest` in

```r
library(randomForest)
*iris.rf <- randomForest(Species ~ ., iris)
iris.rf
```

```
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          3        47        0.06
```
]]

.column.bg-main3[.content.vmiddle.center[
## Using the `randomForest`  function, we can train our ensemble learning using the same formula we passed to `rpart`.
]]

---

.column.bg-main1[.content[

# `randomForest` in

```r
library(randomForest)
iris.rf <- randomForest(Species ~ ., iris) 
*predict(iris.rf, New_Data)
```

```
##      1 
## setosa 
## Levels: setosa versicolor virginica
```
]]

.column.bg-main3[.content.vmiddle.center[
## Again, we can use the same `New_Data` values in the `predict` function as before to reach the same prediction we did with `rpart`.

]]

---

# .purple[A bigger example]

### Let's look at the `SRBCT` dataset, from the package `multiDA`. This data has a response variable `vy`, which is a `factor` of four different cancer subtypes, and gene expression data from 1586 genes. In total, there are 63 observations.

```r
library(multiDA)
table(SRBCT$vy)
```

```
## 
##  1  2  3  4 
##  8 23 12 20
```

```r
dim(SRBCT$mX)
```

```
## [1]   63 1586
```

---

```r
*tree.SRBCT <- rpart(SRBCT$vy ~ SRBCT$mX, method = "class")
rpart.plot(tree.SRBCT)
```

<img src="index_files/figure-html/unnamed-chunk-53-1.png" width="504" />
]]

.column.bg-main3[.content.vmiddle.center[

## Using the `rpart` function along with `rpart.plot`, we can see the tree used to classify the `SRBCT` data. Note how simple it is!

###  PS - note here the alternative formulation of the input to the `rpart` function. If we input `X` and `y` directly, we don't need to reference the data set.
]]

---

```r
*SRBCT.rf <- randomForest(SRBCT$mX, SRBCT$vy)
SRBCT.rf
```

```
## 
## Call:
##  randomForest(x = SRBCT$mX, y = SRBCT$vy) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 39
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##   1  2  3  4 class.error
## 1 8  0  0  0           0
## 2 0 23  0  0           0
## 3 0  0 12  0           0
## 4 0  0  0 20           0
```
]]

.column.bg-main3[.content.vmiddle.center[
## Using the `randomForest`  function we can also classify on the same data.

###  PS, here I use yet another formulation of input into the `randomForest` function. Note here how `X` comes before `y`!

]]

---

```r
set.seed(1)
*SRBCT.rf <- randomForest(SRBCT$mX, as.numeric(SRBCT$vy))
SRBCT.rf
```

```r
Call:
 randomForest(x = SRBCT$mX, y = as.numeric(SRBCT$vy)) 
*              Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 528

Mean of squared residuals: 0.1641464
                    % Var explained: 85.07
```

]]

.column.bg-main3[.content.vmiddle.center[

## Warning: when using `randomForest` make sure that your response is a **factor** if you want classification. If `SRBCT$vy` had been *numeric*, `randomForest` would have instead tried to *regress* the predictors to predict a continuous response!

]]

---

.column.bg-main1[.content[

# There are many more machine learning algorithms that I haven't touched on today! Some include...

### - Support Vector Machines (`e1071`)
### - Boosted Tree Models (`xgboost`)
### - Discriminant Analysis methods (`MASS`, `sparsediscrim`, `multiDA`, `pamr`)
### - Regularised GLMs (`glmnet`)
### - Neural Networks (`neuralnet`,`keras`)

]]

###### [Jacob Shiff](https://medium.com/@jacobshiff/machine-learning-a-brief-primer-ebd1ea265f71)

]]

---

]
.column.slide-in-right[.content.vmiddle[
.sliderbox.shade_main.pad1[
.font5[Thank you!]
]
]]

---
class: bg-main1

# Big thanks go to:

## - Emi Tanaka for developing the Kunoichi themed slides, and also Yihui Xie for developing `xaringan`!
## - Mark Greenaway for helping hosting my shiny app.
## - The R-Ladies Sydney organisers - you guys rock!

##

---
class: split-two white

.column.bg-main1[.content.vmiddle.center[

# Stay tuned for .orange[Part 2] next Wednesday - 'Assessing Model Performance'!

#

]]

###### [SMBC](https://www.smbc-comics.com/comic/rise-of-the-machines)

]]