" />
class: center, middle, inverse, title-slide # Machine Learning 101 ## Supervised Learning in R ###
Sarah Romanes
<i class="fab fa-twitter faa-float animated "></i> @sarah_romanes
###
10-Oct-2018
<i class="fas fa-link faa-vertical animated " style=" color:white;"></i> bit.ly/rladies-sydney-ML-1
--- layout: false class: bg-main3 split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[Welcome!] ] ]] --- class: split-two white .column.bg-main1[.content.vmiddle.center[ # Overview <br> ### This two part R-Ladies workshop is designed to give you a small taster into the large field that is known as .orange[**Machine Learning**]! Today, we will cover supervised learning techniques (explained later), and next week we will cover model performance assessment. <br> ### For a more indepth explanation of topics covered, please read the *free* Introduction to Statistical Learning textbook (James, Witten, Hastie, and Tibshirani) [here](http://www-bcf.usc.edu/~gareth/ISL/). ]] .column.bg-main3[.content.vmiddle.center[ <center> <img src="images/ISLR.jpg", width="70%"> ]] --- class: middle center bg-main1 <img src="images/mlmeme.jpg", width="70%"> --- # .purple[What *is* Machine Learning?] <br> ### Machine learning is concerned with finding functions `\(Y=f(X)+\epsilon\)` that best **predict** outputs (responses), given data inputs (predictors). <br> <center> <img src="images/ml-process.png", width="50%"> </center> <br> ### Mathematically, Machine Learning problems are simply *optimisation* problems, in which we will use .purple[
] to help us solve! --- # .purple[Why do Machine Learning in
?] <br> <center> <img src="images/python-r-other-2016-2017.jpg", width="70%"> </center> --- class: split-two white .column.bg-main1[.content[ <br> # Types of Learning <br> ## .orange[Supervised Learning] ### - We have knowledge of class labels or values. ### - Goal: train a model using known class labels to predict class or value label for a new data point. ## .orange[Unsupervised Learning] ### - No knowledge of output class or value –data is unlabelled. ### - Goal: determine data patterns/groupings. ]] .column[.content.vmiddle.center[ <img src="images/learning.png", width="80%"> ##### .purple[Credit to Towards Data Science] ]] --- layout: false class: bg-main3 split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[Regression] ] ]] --- class: split-two white .column.bg-main1[.content[ <br> # A simple prediction problem ### Consider the simulated dataset `Income`, which looks at the relationship between `Education` (years) and `Income` (thousands). ```r data <- read.csv("data/Income.csv") head(data) ``` ``` ## Education Income ## 1 10.00000 26.65884 ## 2 10.40134 27.30644 ## 3 10.84281 22.13241 ## 4 11.24415 21.16984 ## 5 11.64548 15.19263 ## 6 12.08696 26.39895 ``` ### What shape is the relationship? Can we build a function to predict the value of a new data point? ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ]] --- class: middle center bg-main1 ## Linear Regression to the Rescue! <img src="images/ols.png", width="70%"> --- class: pink-code # .purple[What line is optimal?] <br> ### We can use a .purple[**linear function**] to describe our data and .purple[**predict**] a new data point. The question is, what is the .purple[*optimal line*] ? -- ### One way to do this is find the slope and intercept that .purple[*minimise*] the sum of the squared residuals between the line and our data points: <center> <img src="images/rss.png", width="25%"> </center> ### We can visualise this optimisation process using a .purple[**Shiny App**] [here](http://43.240.99.178:3838/sample-apps/LinearRegression/). -- ### .purple[
] performs this optimisation process for us when we call the `lm` function. ###### (PS) see an alternative derivation [here](https://sarahromanes.github.io/post/gganimate/)! --- class: split-60 white .column.bg-main1[.content[ # We can fit a linear model in
using the `lm` function as follows: <br> ```r *fit <- lm(data=data, Income ~ Education) ``` ]] .column.bg-main3[.content.vmiddle.center[ # This tells the `lm` function what data we are referring to. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a linear model in
using the `lm` function as follows: <br> ```r fit <- lm(data=data, * Income ~ Education) ``` ]] .column.bg-main3[.content.vmiddle.center[ ## This tells the `lm` function what variables we would like to regress. ### R expects the relationship in the form of `response~predictors`. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a linear model in
using the `lm` function as follows: ```r fit <- lm(data=data, Income ~ Education) *summary(fit) ``` ```r Call: lm(formula = Income ~ Education, data = data) Residuals: Min 1Q Median 3Q Max -13.046 -2.293 0.472 3.288 10.110 Coefficients: Estimate Std. Error t value Pr(>|t|) *(Intercept) -39.4463 4.7248 -8.349 4.4e-09 *** Education 5.5995 0.2882 19.431 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.653 on 28 degrees of freedom Multiple R-squared: 0.931, Adjusted R-squared: 0.9285 F-statistic: 377.6 on 1 and 28 DF, p-value: < 2.2e-16 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## We can use the `summary` function to examine regression coefficients, and information about the residuals of our model. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a linear model in
using the `lm` function as follows: ```r fit <- lm(data=data, Income ~ Education) *summary(fit) ``` ```r Call: lm(formula = Income ~ Education, data = data) Residuals: Min 1Q Median 3Q Max -13.046 -2.293 0.472 3.288 10.110 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -39.4463 4.7248 -8.349 4.4e-09 *** *Education 5.5995 0.2882 19.431 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.653 on 28 degrees of freedom Multiple R-squared: 0.931, Adjusted R-squared: 0.9285 F-statistic: 377.6 on 1 and 28 DF, p-value: < 2.2e-16 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## We can use the `summary` function to examine regression coefficients, and information about the residuals of our model. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a linear model in
using the `lm` function as follows: ```r fit <- lm(data=data, Income ~ Education) New_Data <- data.frame(Education = c(15, 18)) *predict(fit, New_Data) ``` ``` ## 1 2 ## 44.54599 61.34444 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Using the `predict` function, we can predict the `Income` of a new `Education` value (or values). ]] --- class: split-two white .column.bg-main1[.content[ <br> # Modelling Binary Outcomes <br> ### Consider the dataset `Spiders`, Suzuki et al. (2006) , from a study which looked at the relationship between `GrainSize` of sand and `Spiders` presence. ```r data <- read.csv("data/Spiders.csv") head(data,3) ``` ``` ## GrainSize Spiders ## 1 0.245 0 ## 2 0.247 0 ## 3 0.285 1 ``` ### Can we use `lm` to predict the class for a new data point? ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ]] --- class: split-two white .column.bg-main1[.content[ <br> ## To model **binary data**, we need to .orange[link] our **predictors** to our response using a *link function*. <br> ### Instead of predicting the outcome directly, we instead predict the probability of being class 1, given the linear combination of predictors, as follows: $$ p(y=1|\\beta_0 + \\beta_1 x) = \\sigma(\\beta_0 + \\beta_1 x) $$ For the logistic (`logit`) link $$ p(y=1|\\beta_0 + \\beta_1 x) = \\frac{1}{1+ \\exp( - (\\beta_0 + \\beta_1 x))} $$ For the probit (`probit`) link $$ p(y=1|\\beta_0 + \\beta_1 x) = \\Phi(\\beta_0 + \\beta_1 x) $$ ]] .column.white[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a glm in
using the `glm` function as follows: <br> ```r *fit <- glm(data=data, Spiders~GrainSize, family=binomial(link="logit")) ``` ]] .column.bg-main3[.content.vmiddle.center[ # This tells the `glm` function what data we are referring to. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a glm in
using the `glm` function as follows: <br> ```r fit <- glm(data=data, * Spiders~GrainSize, family=binomial(link="logit")) ``` ]] .column.bg-main3[.content.vmiddle.center[ ## This tells the `glm` function what variables we would like to regress. ### Just like the `lm` function, R expects the relationship in the form of `response~predictors`. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a glm in
using the `glm` function as follows: <br> ```r fit <- glm(data=data, Spiders~GrainSize, * family=binomial(link="logit")) ``` <center> OR </center> ```r fit <- glm(data=data, Spiders~GrainSize, * family=binomial(link="probit")) ``` <center> OR </center> ```r fit <- glm(data=data, Spiders~GrainSize, * family=binomial(link="cloglog")) ``` <center> and more... </center> ]] .column.bg-main3[.content.vmiddle.center[ ## This tells the `glm` function how we would like to model our response. For **binary** response data, we use the `binomial` family. ## Further, there are many ways we can link our linear combination of predictors to the 0,1 space. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a glm in
using the `glm` function as follows: <br> ```r fit <- glm(data=data, Spiders~GrainSize, family=binomial(link="logit")) summary(fit) ``` ```r Call: glm(formula = Spiders ~ GrainSize, family = binomial(link = "logit"), data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.7406 -1.0781 0.4837 0.9809 1.2582 Coefficients: Estimate Std. Error z value Pr(>|z|) *(Intercept) -1.648 1.354 -1.217 0.2237 *GrainSize 5.122 3.006 1.704 0.0884 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Similar to the `lm` function, we can use the `summary` function to examine regression coefficients. ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit a glm in
using the `glm` function as follows: <br> ```r fit <- glm(data=data, Spiders~GrainSize, family=binomial(link="logit")) New_Data <- data.frame(GrainSize = c(0.4, 0.87)) *probs <- predict(fit, New_Data,type="response") probs ``` ``` ## 1 2 ## 0.5989270 0.9431134 ``` ```r *round(probs) ``` ``` ## 1 2 ## 1 1 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## And we can also use the `predict` function to estimate class membership probability, as well as use the `round` command to estimate class. ]] --- class: middle center bg-main1 ## Of course, probabilities close to 0.5 are hard to classify!! <img src="images/response.jpg", width="70%"> --- class: middle center bg-main1
<i class="fas fa-exclamation-triangle fa-7x faa-flash animated "></i>
# A warning for using GLMs! --- class: split-two white .column[.content[ ```r data.new <- read.csv("data/SpidersWarning.csv") ggplot(data.new, aes(x=GrainSize, y=Spiders)) + geom_point(col="red", size=3) ``` <img src="index_files/figure-html/unnamed-chunk-23-1.png" width="504" /> ]] .column.bg-main3[.content.vmiddle.center[ ## Suppose the scientist who collected their data thought the study would look more convincing if the data was **perfectly** sepatated, and sneakily modified their results. How would their glm look? ]] --- class: split-60 white .column[.content[ <br> ```r fit <- glm(data=data.new, Spiders~GrainSize, family=binomial(link="logit")) ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Warnings appear when we try and fit this glm... ]] --- class: split-60 white .column[.content[ <br> ```r fit <- glm(data=data.new, Spiders~GrainSize, family=binomial(link="logit")) ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` ```r *summary(fit) ``` ```r Call: glm(formula = Spiders ~ GrainSize, family = binomial(link = "logit"), data = data.new) Deviance Residuals: Min 1Q Median 3Q Max -8.087e-05 -2.100e-08 -2.100e-08 2.100e-08 7.488e-05 Coefficients: Estimate Std. Error z value Pr(>|z|) *(Intercept) -912.4 362618.2 -0.003 0.998 *GrainSize 1569.2 624478.6 0.003 0.998 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Warnings appear when we try and fit this glm... ## ...and, when we look at the `summary`, we can see our regression coefficients are blowing up, as well as the standard errors associated with them! ]] --- class: middle center bg-main1 <img src="images/warning.jpg", width="70%"> --- class: split-two white .column.bg-main1[.content[ <br> # Multiple Regression <br> ### Can we fit a model with more than one predictor? Of course we can! Consider the dataset `Exam`, where two exam scores are given for each student, and a Label represents whether they passed or failed the course. ```r data<- read.csv("data/Exam.csv", header=T) head(data,4) ``` ``` ## Exam1 Exam2 Label ## 1 34.62366 78.02469 0 ## 2 30.28671 43.89500 0 ## 3 35.84741 72.90220 0 ## 4 60.18260 86.30855 1 ``` ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="504" /> ]] --- class: split-60 white .column.bg-main1[.content[ # We can fit the glm in
using the `glm` function as follows: <br> ```r fit <- glm(data=data, * Label ~ ., family=binomial(link="logit")) summary(fit) ``` ```r Call: glm(formula = Label ~ ., family = binomial(link = "logit"), data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.19287 -0.18009 0.01577 0.19578 1.78527 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -25.16133 5.79836 -4.339 1.43e-05 *** Exam1 0.20623 0.04800 4.297 1.73e-05 *** Exam2 0.20147 0.04862 4.144 3.42e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` ]] .column.bg-main3[.content.vmiddle.center[ ### The formula `Label ~ . ` tells the `glm` function to regress against all (two) predictors, `Exam1` and `Exam2`. ]] --- class: split-two white .column.bg-main1[.content[ <br> # The Decision Boundary <br> ### We can plot the **decision boundary** on our scatterplot for `Exam1` vs `Exam2` by looking at the set of points for which our classifier predicts `$$p(y=1|\beta_0 + \beta_1 \text{Exam1} + \beta_2 \text{Exam2}) \geq 0.5$$` #### Given the `logit` link, this is true when `$$\beta_0 + \beta_1 \text{Exam1} + \beta_2 \text{Exam2} \geq 0$$` #### This is the **decision boundary** and can be rearranged as $$ \text{Exam2} \geq \frac{-\beta_0}{\beta_2} + \frac{-\beta_1}{\beta_2} \text{Exam 1} $$ ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-32-1.png" width="504" /> ]] --- layout: false class: bg-main3 split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[Classification] ] ]] --- class: middle center bg-main1 <img src="images/food.jpg", width="65%"> --- class: split-two white .column.bg-main1[.content[ <br> # When we can't use GLMs <br> ### Consider the dataset `Microchips`, containing test scores `Test1` and `Test2`, and `Label` indicating whether or not the chip passed the test. ```r data <- read.csv("data/Microchips.csv") head(data,4) ``` ``` ## Test1 Test2 Label ## 1 0.051267 0.69956 1 ## 2 -0.092742 0.68494 1 ## 3 -0.213710 0.69225 1 ## 4 -0.375000 0.50219 1 ``` ### Why can't we use `glm` to predict the class for a new data point? ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-34-1.png" width="504" /> ]] --- # .purple[Different data types require different machine learning methods] <br> ## While we can indeed use Logistic regression to .purple[**classify**] data points, this simply isn't feasible when: ### - We have high class seperation in our data ### - We have a non-linear combination of predictors influcing our response (as in this case) <br> ## .purple[**So, what other options do we have?**] --- class: split-two white .column.bg-main1[.content[ <br> # The K-Nearest Neighbours Algorithm <br> ## .orange[**K Nearest Neighbours (KNN)**] is a non-parametric algorithm (doesn't assume the data follows any particular shape). <br> ## In KNN, we vote on the class of a new data point by looking at the majority class of the .orange[**K**] nearest neighbours. ]] .column[.content.vmiddle.center[ <img src="images/knn2-Natasha-Latysheva.jpg", width="80%"> ##### .purple[Credit to Natasha Latysheva] ]] --- class: split-two white .column.bg-main1[.content[ <br> # KNN in
<br> ```r X <- data[,1:2] cl <- as.factor(data[,3]) ``` ]] .column.bg-main3[.content.vmiddle.center[ ## First, we split our data into **predictors** (`X`), and response (`cl` for *class*), ]] --- class: split-two white .column.bg-main1[.content[ <br> # KNN in
<br> ### Next, we need to give the function new data points, as the `knn` function fits and predicts at the same time. We will use the `MASS` package to generate a grid of new points to predict the class of. ```r X <- data[,1:2] cl <- as.factor(data[,3]) library(MASS) *new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5), * by=0.1), * y=seq(min(X[,2]-0.5), max(X[,2]+0.5), * by=0.1)) ``` ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-37-1.png" width="504" /> ]] --- class: split-two white .column.bg-main1[.content[ <br> # KNN in
<br> ```r X <- data[,1:2] cl <- as.factor(data[,3]) library(MASS) new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5), by=0.1), y=seq(min(X[,2]-0.5), max(X[,2]+0.5), by=0.1)) library(class) *classif <- knn(X, new.X, cl, k = 3, prob=TRUE) ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Now we call the `knn` function, putting in our datapoints, our new data, and the classes of the original data points, as well as how many neighbours we want to consider. In this example, we consider `k=3`. ]] --- class: split-two white .column.bg-main1[.content[ <br> # KNN in
<br> ```r X <- data[,1:2] cl <- as.factor(data[,3]) library(MASS) new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5), by=0.1), y=seq(min(X[,2]-0.5), max(X[,2]+0.5), by=0.1)) library(class) classif <- knn(X, new.X, cl, k = 3, prob=TRUE) ``` <br> <center> We can plot the decision boundaries, as well as the classification of our new points, as follows. Detailed R code is in slide raw files. </center> ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-40-1.png" width="504" /> ]] --- class: split-two white .column.bg-main1[.content[ <br> # KNN in
<br> ```r X <- data[,1:2] cl <- as.factor(data[,3]) library(MASS) new.X <- expand.grid(x=seq(min(X[,1]-0.5), max(X[,1]+0.5), by=0.1), y=seq(min(X[,2]-0.5), max(X[,2]+0.5), by=0.1)) library(class) *classif <- knn(X, new.X, cl, k = 9, prob=TRUE) ``` <br> <center> We can also see how our decisions change based on the number of neighbours we consider! Here, we consider `k=9`. </center> ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-42-1.png" width="504" /> ]] --- class: middle center bg-main1 ## What other classifiers exist besides KNN? <img src="images/knn.jpg", width="60%"> --- class: split-two white .column[.content[ <br> # .purple[Decision Trees] <br> <img src="images/tree.jpg", width="90%"> <br> ]] .column.bg-main1[.content.vmiddle.center[ ### A Decision Tree determines the predictive value based on series of questions and conditions. <br> ### When making Decision Trees, there are several factors we take into consideration. These include - on what features do we make our decisions on? What is the threshold for classifying each question into a .orange[**yes**] or .orange[**no**] answer? <br> Image Source: [Egor Dezhic](https://becominghuman.ai) ]] --- class: pink-code # .purple[Decision Trees in
using `rpart`] <br> ## Consider the `iris` dataset ```r head(iris) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ``` ## How can we construct a decision tree to determine the `Species` of `iris`? --- class: split-two white .column.bg-main1[.content[ <br> # Trees in
```r library(rpart) *tree <- rpart(Species ~ ., data = iris, method = "class") ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Using the `rpart` function, we build a classification tree, using the formula `Species~.` to indicate that we wish to use all the variables to predict our class. ]] --- class: split-two white .column.bg-main1[.content[ <br> # Trees in
```r library(rpart) tree <- rpart(Species ~ ., data = iris, method = "class") *library(rpart.plot) *rpart.plot(tree) ``` <br> ### Futher, using the `rpart.plot` function from the `rpart.plot` library, we can visualise the results of our classification tree. ### Each node shows the predicted class, the predicted probability of each class, and the percentage of observations in each node. ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-46-1.png" width="504" height="70%" /> ]] --- class: split-two white .column.bg-main1[.content[ <br> # Trees in
```r library(rpart) tree <- rpart(Species ~ ., data = iris, method = "class") library(rpart.plot) rpart.plot(tree) ``` ```r New_Data <-data.frame(Sepal.Length= 5.0, Sepal.Width = 3.9, Petal.Length = 1.4, Petal.Width = 0.3) *predict(tree, New_Data) ``` ``` ## setosa versicolor virginica ## 1 1 0 0 ``` <br> #### Using `predict`, we can predict the class of a new data point, much like in the previous examples. Alternatively - we could have predicted using the tree ourselves! ]] .column[.content.vmiddle.center[ <img src="index_files/figure-html/unnamed-chunk-49-1.png" width="504" height="70%" /> ]] --- # A single tree is prone to overfitting <br> <center> <img src="images/overfit.jpg", width="70%"> </center> <br> --- class: middle center # .purple[Ensemble Learning: Bagging] <img src="images/bagging.png", width="90%"> --- class: split-two white .column.bg-main1[.content[ <br> # From bagging to Random Forest <br> ### With Random Forest we randomly select a subset of predictors to use for each tree. <br> ### The reason? Tree correlation. <br> ### If one or a few features are very strong predictors for the response variable, then these features will be selected in many of the trees, causing them to become correlated. .orange[This can reduce the accuracy of the ensemble.] ]] .column[.content.vmiddle.center[ <img src="images/randomforest.png", width="85%"> ###### .black[Source:] [GSS](http://www.globalsoftwaresupport.com/random-forest-classifier-bagging-machine-learning/) ]] --- class: split-two white .column.bg-main1[.content[ <br> # `randomForest` in
```r library(randomForest) *iris.rf <- randomForest(Species ~ ., iris) iris.rf ``` ``` ## ## Call: ## randomForest(formula = Species ~ ., data = iris) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 4% ## Confusion matrix: ## setosa versicolor virginica class.error ## setosa 50 0 0 0.00 ## versicolor 0 47 3 0.06 ## virginica 0 3 47 0.06 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Using the `randomForest` function, we can train our ensemble learning using the same formula we passed to `rpart`. ]] --- class: split-two white .column.bg-main1[.content[ <br> # `randomForest` in
```r library(randomForest) iris.rf <- randomForest(Species ~ ., iris) *predict(iris.rf, New_Data) ``` ``` ## 1 ## setosa ## Levels: setosa versicolor virginica ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Again, we can use the same `New_Data` values in the `predict` function as before to reach the same prediction we did with `rpart`. ]] --- class: pink-code # .purple[A bigger example] ### Let's look at the `SRBCT` dataset, from the package `multiDA`. This data has a response variable `vy`, which is a `factor` of four different cancer subtypes, and gene expression data from 1586 genes. In total, there are 63 observations. ```r library(multiDA) table(SRBCT$vy) ``` ``` ## ## 1 2 3 4 ## 8 23 12 20 ``` ```r dim(SRBCT$mX) ``` ``` ## [1] 63 1586 ``` --- class: split-two white .column[.content[ ```r *tree.SRBCT <- rpart(SRBCT$vy ~ SRBCT$mX, method = "class") rpart.plot(tree.SRBCT) ``` <img src="index_files/figure-html/unnamed-chunk-53-1.png" width="504" /> ]] .column.bg-main3[.content.vmiddle.center[ ## Using the `rpart` function along with `rpart.plot`, we can see the tree used to classify the `SRBCT` data. Note how simple it is! ### PS - note here the alternative formulation of the input to the `rpart` function. If we input `X` and `y` directly, we don't need to reference the data set. ]] --- class: split-two white .column[.content[ <br> ```r *SRBCT.rf <- randomForest(SRBCT$mX, SRBCT$vy) SRBCT.rf ``` ``` ## ## Call: ## randomForest(x = SRBCT$mX, y = SRBCT$vy) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 39 ## ## OOB estimate of error rate: 0% ## Confusion matrix: ## 1 2 3 4 class.error ## 1 8 0 0 0 0 ## 2 0 23 0 0 0 ## 3 0 0 12 0 0 ## 4 0 0 0 20 0 ``` ]] .column.bg-main3[.content.vmiddle.center[ ## Using the `randomForest` function we can also classify on the same data. ### PS, here I use yet another formulation of input into the `randomForest` function. Note here how `X` comes before `y`! ]] --- class: split-two white .column[.content[ <br> ```r set.seed(1) *SRBCT.rf <- randomForest(SRBCT$mX, as.numeric(SRBCT$vy)) SRBCT.rf ``` ```r Call: randomForest(x = SRBCT$mX, y = as.numeric(SRBCT$vy)) * Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 528 Mean of squared residuals: 0.1641464 % Var explained: 85.07 ``` ]] .column.bg-main3[.content.vmiddle.center[ ##
<i class="fas fa-exclamation-triangle fa-2x faa-flash animated "></i>
Warning: when using `randomForest` make sure that your response is a **factor** if you want classification. If `SRBCT$vy` had been *numeric*, `randomForest` would have instead tried to *regress* the predictors to predict a continuous response! ]] --- class: split-two white .column.bg-main1[.content[ <br> # There are many more machine learning algorithms that I haven't touched on today! Some include... <br> ### - Support Vector Machines (`e1071`) ### - Boosted Tree Models (`xgboost`) ### - Discriminant Analysis methods (`MASS`, `sparsediscrim`, `multiDA`, `pamr`) ### - Regularised GLMs (`glmnet`) ### - Neural Networks (`neuralnet`,`keras`) ]] .column[.content.vmiddle.center[ <img src="images/zoolander.jpeg", width="80%"> ###### [Jacob Shiff](https://medium.com/@jacobshiff/machine-learning-a-brief-primer-ebd1ea265f71) ]] --- layout: false class: bg-main3 split-30 hide-slide-number .column[ ] .column.slide-in-right[.content.vmiddle[ .sliderbox.shade_main.pad1[ .font5[Thank you!] ] ]] --- class: bg-main1 # Big thanks go to: <br> ## - Emi Tanaka for developing the Kunoichi themed slides, and also Yihui Xie for developing `xaringan`! ## - Mark Greenaway for helping hosting my shiny app. ## - The R-Ladies Sydney organisers - you guys rock! ##
<i class="fas fa-thumbs-up fa-5x faa-float animated "></i>
--- class: split-two white .column.bg-main1[.content.vmiddle.center[ # Stay tuned for .orange[Part 2] next Wednesday - 'Assessing Model Performance'! #
<i class="fas fa-cogs fa-5x faa-vertical animated "></i>
]] .column[.content.vmiddle.center[ <img src="images/machinewar.png", width="70%"> ###### [SMBC](https://www.smbc-comics.com/comic/rise-of-the-machines) ]]