This post will talk about multiple linear regression in the context of machine learning. Linear regression is one of the simplest and most used approaches for supervised learning. This tutorial will try to help you in how to use the linear regression algorithm. I am also new to the machine learning approach, but I’m very interested in this area given the predictive ability that you can gain from this. Let’s hope I can help you. During the tutorial, we will build various multiple linear regression models. Next, we will evaluate these models, choose the more accurate one and evaluate how well the model generalizes to new data.

Throughout the post we will be using data extracted from the excellent book ‘Data Mining for Business Analytics: Concepts, Techniques, and Applications in R’. The data is called Airfares and it’s from the 90s. The data frame has several variables as you will be able to ascertain soon. The challenge related to this data frame is how the deregulation of airlines in the 70’s, and consequently the entrance of new carriers with lower prices like Southwest airlines, influence the prices of airfares. The business problem is to predict how much it will cost the airfares in a new flight route. So, imagine that you are working for an airline company that intends to open a new route in the USA and it has data related to the characteristics of airports, economic and demographic information about the cities where flights arrive and depart, flight distances and average price of air flights. The company to whom we work wants to open a new route, but doesn’t know which average price should ask for. The goal of your job in this to find a model that can better predict the average price of airfares.

Loading and Exploring our Data Frame

Now, let’s first load the packages to be used in our analysis.

library(here) # set directory
library(tidyverse) # data wrangling and visualization
library(caret) # machine learning techniques
library(corrr) # visualize correlations
library(ggcorrplot) # visualize correlations
library(lmtest) # check homoscedasticity
library(MASS) # create stepwise regression models
library(car) # analyse homoscedasticity
library(yardstick) # check regression models metrics
library(broom) # tidy statistical models
options(scipen = 999) # disable scientific notation

Next , we open our dataset and start to explore it.

# open file
airfares <- read_csv(here::here("airfares.csv")) %>% 
  rename_all(str_to_lower) # all variables' names to lower case

# explore structure of the data frame
glimpse(airfares)
## Observations: 638
## Variables: 18
## $ s_code   <chr> "*", "*", "*", "ORD", "MDW", "*", "*", "*", "*", "*",...
## $ s_city   <chr> "Dallas/Fort Worth   TX", "Atlanta             GA", "...
## $ e_code   <chr> "*", "*", "*", "*", "*", "*", "*", "*", "*", "*", "*"...
## $ e_city   <chr> "Amarillo            TX", "Baltimore/Wash Intl MD", "...
## $ coupon   <dbl> 1.00, 1.06, 1.06, 1.06, 1.06, 1.01, 1.28, 1.15, 1.33,...
## $ new      <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3,...
## $ vacation <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No"...
## $ sw       <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "Yes", ...
## $ hi       <dbl> 5291.99, 5419.16, 9185.28, 2657.35, 2657.35, 3408.11,...
## $ s_income <dbl> 28637, 26993, 30124, 29260, 29260, 26046, 28637, 2675...
## $ e_income <dbl> 21112, 29838, 29838, 29838, 29838, 29838, 29838, 2983...
## $ s_pop    <dbl> 3036732, 3532657, 5787293, 7830332, 7830332, 2230955,...
## $ e_pop    <dbl> 205711, 7145897, 7145897, 7145897, 7145897, 7145897, ...
## $ slot     <chr> "Free", "Free", "Free", "Controlled", "Free", "Free",...
## $ gate     <chr> "Free", "Free", "Free", "Free", "Free", "Free", "Free...
## $ distance <dbl> 312, 576, 364, 612, 612, 309, 1220, 921, 1249, 964, 2...
## $ pax      <dbl> 7864, 8820, 6452, 25144, 25144, 13386, 4625, 5512, 78...
## $ fare     <dbl> 64.11, 174.47, 207.76, 85.47, 85.47, 56.76, 228.00, 1...

A glimpse() of our variables show that we have 18 variables. In the following table copied from the book ‘Data Mining for Business Analytics’, you can see a more extensive description of each variable.

Let’s further explore our dataset by using the summary() function to check our data.

# explore dataset
summary(airfares)
##     s_code             s_city             e_code         
##  Length:638         Length:638         Length:638        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     e_city              coupon           new          vacation        
##  Length:638         Min.   :1.000   Min.   :0.000   Length:638        
##  Class :character   1st Qu.:1.040   1st Qu.:3.000   Class :character  
##  Mode  :character   Median :1.150   Median :3.000   Mode  :character  
##                     Mean   :1.202   Mean   :2.754                     
##                     3rd Qu.:1.298   3rd Qu.:3.000                     
##                     Max.   :1.940   Max.   :3.000                     
##       sw                  hi           s_income        e_income    
##  Length:638         Min.   : 1230   Min.   :14600   Min.   :14600  
##  Class :character   1st Qu.: 3090   1st Qu.:24706   1st Qu.:23903  
##  Mode  :character   Median : 4208   Median :28637   Median :26409  
##                     Mean   : 4442   Mean   :27760   Mean   :27664  
##                     3rd Qu.: 5481   3rd Qu.:29694   3rd Qu.:31981  
##                     Max.   :10000   Max.   :38813   Max.   :38813  
##      s_pop             e_pop             slot               gate          
##  Min.   :  29838   Min.   : 111745   Length:638         Length:638        
##  1st Qu.:1862106   1st Qu.:1228816   Class :character   Class :character  
##  Median :3532657   Median :2195215   Mode  :character   Mode  :character  
##  Mean   :4557004   Mean   :3194503                                        
##  3rd Qu.:7830332   3rd Qu.:4549784                                        
##  Max.   :9056076   Max.   :9056076                                        
##     distance           pax             fare       
##  Min.   : 114.0   Min.   : 1504   Min.   : 42.47  
##  1st Qu.: 455.0   1st Qu.: 5328   1st Qu.:106.29  
##  Median : 850.0   Median : 7792   Median :144.60  
##  Mean   : 975.7   Mean   :12782   Mean   :160.88  
##  3rd Qu.:1306.2   3rd Qu.:14090   3rd Qu.:209.35  
##  Max.   :2764.0   Max.   :73892   Max.   :402.02

Moreover, we will check if there are any missing values.

sum(is.na(airfares)) # check if there are missing values
## [1] 0

We don’t have any missing values.

Now we can visualize the distribution of our dependent variable named fare.

# visualize our dependent variable
ggplot(airfares, aes(fare)) +
  geom_histogram(binwidth = 10)

Next, we will remove the first four variables from our data frame (s_code, s_city, e_code, and e_city).

# remove variables 
airfares <- airfares %>%
  dplyr::select(-c(1:4)) # remove variables 1 to 4. These correspond to variables (s_code, s_city, e_code, and e_city)


glimpse(airfares)
## Observations: 638
## Variables: 14
## $ coupon   <dbl> 1.00, 1.06, 1.06, 1.06, 1.06, 1.01, 1.28, 1.15, 1.33,...
## $ new      <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3,...
## $ vacation <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No"...
## $ sw       <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "Yes", ...
## $ hi       <dbl> 5291.99, 5419.16, 9185.28, 2657.35, 2657.35, 3408.11,...
## $ s_income <dbl> 28637, 26993, 30124, 29260, 29260, 26046, 28637, 2675...
## $ e_income <dbl> 21112, 29838, 29838, 29838, 29838, 29838, 29838, 2983...
## $ s_pop    <dbl> 3036732, 3532657, 5787293, 7830332, 7830332, 2230955,...
## $ e_pop    <dbl> 205711, 7145897, 7145897, 7145897, 7145897, 7145897, ...
## $ slot     <chr> "Free", "Free", "Free", "Controlled", "Free", "Free",...
## $ gate     <chr> "Free", "Free", "Free", "Free", "Free", "Free", "Free...
## $ distance <dbl> 312, 576, 364, 612, 612, 309, 1220, 921, 1249, 964, 2...
## $ pax      <dbl> 7864, 8820, 6452, 25144, 25144, 13386, 4625, 5512, 78...
## $ fare     <dbl> 64.11, 174.47, 207.76, 85.47, 85.47, 56.76, 228.00, 1...

Now, we will keep exploring the data and check for correlations between the numeric variables of our data frame.

We will use the package ggcorrplot for this.

# check for correlations

# create correlations dataframe
cor_df <- airfares %>% 
  select_if(is.numeric) %>%
  cor()

# visualize correlations with ggcorrplot
ggcorrplot(cor_df, hc.order = TRUE, type = "lower", 
           lab = TRUE)

From these correlations, we can verify that the prices of airlines have a high positive correlation with the variables coupon and distance. Note: coupon and distance are highly correlated which can be a problem due to the multicollinearity assumption of the regression.

Here with the function network_plot() from the corrr package we can see that above 0.3, only distance, coupon and e_income are correlated with the variable fare.

# network plot
cor_df %>%
  network_plot(min_cor = 0.3) # check network plot of correlations above 0.3 in the positive and negative direction

Creation of Regression Models

In the following step, we will create a train and a test dataset, so that we can fit our training model and apply it to a test dataset in order to validate it and see whether it generalizes into a new data frame. We will use the createDataPartition() function from the caret package to make this partition.

# partition the data
set.seed(1234)
partition <- createDataPartition(airfares$fare, p = 0.7, list = FALSE) # 70% corresponds to the train data frame and 30% to the data frame

# create train and test dataframes
train_airfares <- airfares[partition, ]

test_airfares <- airfares[-partition, ]

Let us now create our first regression model with our training set. We will name it model0.

# creation of model0
model0 <- lm(fare ~ ., data = train_airfares)
summary(model0)
## 
## Call:
## lm(formula = fare ~ ., data = train_airfares)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.556  -19.352   -1.793   20.303  106.427 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   9.9488140255  31.5652926156   0.315             0.752775    
## coupon       15.0645997501  13.6701254718   1.102             0.271067    
## new          -4.4309062496   2.1911643532  -2.022             0.043770 *  
## vacationYes -36.0936214515   4.0706698108  -8.867 < 0.0000000000000002 ***
## swYes       -39.3376061587   4.2770307496  -9.197 < 0.0000000000000002 ***
## hi            0.0066408541   0.0011529652   5.760        0.00000001594 ***
## s_income      0.0013104057   0.0006152507   2.130             0.033743 *  
## e_income      0.0016628700   0.0004437508   3.747             0.000203 ***
## s_pop         0.0000037053   0.0000007535   4.918        0.00000124446 ***
## e_pop         0.0000048570   0.0000008752   5.550        0.00000004978 ***
## slotFree    -16.1624509148   4.4742546626  -3.612             0.000339 ***
## gateFree    -24.3946178248   4.7186388124  -5.170        0.00000035795 ***
## distance      0.0682020180   0.0040489197  16.844 < 0.0000000000000002 ***
## pax          -0.0009939716   0.0001674994  -5.934        0.00000000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.97 on 435 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.7968 
## F-statistic: 136.1 on 13 and 435 DF,  p-value: < 0.00000000000000022

We can see that all variables, but coupon and new, significantly predict the mean price of airlines flights. As mentioned before, this data has in mind the entrance of new players with low prices, as the Southwest Airlines, in the price of tickets flights. The regression shows that the presence of Southwest Airlines(SW) can decrease the mean price of fares in 39.338$.

As an example , we can check some assumptions of the regression in our model. Note: We will not do it for the models created afterwards. It’s just for you to have an idea on how to test some of the regression assumptions. First, we can check for multicollinearity, that is, if there is a high linear association between the predictor variables of our model. To check it , we can use the vif() function.

# test assumption of multicollinearity
car::vif(model0)
##   coupon      new vacation       sw       hi s_income e_income    s_pop 
## 3.054571 1.037796 1.331906 1.520914 1.530992 1.837109 1.567817 2.049647 
##    e_pop     slot     gate distance      pax 
## 2.299528 1.601839 1.305077 2.691463 2.035736

All values are below 5, meaning that we don’t have multicollinearity in our model.

Second, we can check if our residuals show homoscedasticity. We will use the augment() function from the broom package to get the predicted and residuals of our model. After that we will visualize how the residuals distribute themselves.

# test assumptions of heteroskedasticity
aug <- augment(model0)

# visualiza the residuals
aug %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  ggtitle("Residuals vs Fitted")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It looks like we have heteroscedasticity. The residuals are not distributed uniformly along zero. Using the bptest() function from the lmtest package we can come to that conclusion.

lmtest::bptest(model0)
## 
##  studentized Breusch-Pagan test
## 
## data:  model0
## BP = 51.155, df = 13, p-value = 0.000001886

The p-value is below 0.05, so the assumption of homoscedasticity is not satisfied. This result may suggest an incomplete model, but for the purposes of this post we will not give it too much attention. Nonetheless, you should know how to check the regression assumptions.

Putting aside the assumptions of the regression, let’s go back to fit our training data. Now, we will use a stepwise regression method, in order to find a model that minimizes the error. We can use the stepAIC() function from the MASS package.

# create stepwise regression model
step_model <- stepAIC(model0, direction = "both")
## Start:  AIC=3179.78
## fare ~ coupon + new + vacation + sw + hi + s_income + e_income + 
##     s_pop + e_pop + slot + gate + distance + pax
## 
##            Df Sum of Sq    RSS    AIC
## - coupon    1      1402 503512 3179.0
## <none>                  502111 3179.8
## - new       1      4720 506831 3182.0
## - s_income  1      5236 507347 3182.4
## - slot      1     15062 517173 3191.1
## - e_income  1     16209 518319 3192.0
## - s_pop     1     27913 530024 3202.1
## - gate      1     30851 532961 3204.6
## - e_pop     1     35553 537663 3208.5
## - hi        1     38294 540404 3210.8
## - pax       1     40647 542758 3212.7
## - vacation  1     90748 592859 3252.4
## - sw        1     97643 599754 3257.6
## - distance  1    327511 829622 3403.2
## 
## Step:  AIC=3179.03
## fare ~ new + vacation + sw + hi + s_income + e_income + s_pop + 
##     e_pop + slot + gate + distance + pax
## 
##            Df Sum of Sq     RSS    AIC
## <none>                   503512 3179.0
## + coupon    1      1402  502111 3179.8
## - s_income  1      4829  508342 3181.3
## - new       1      4921  508433 3181.4
## - e_income  1     15884  519396 3191.0
## - slot      1     16329  519842 3191.4
## - s_pop     1     27075  530587 3200.5
## - gate      1     31081  534594 3203.9
## - e_pop     1     35944  539456 3208.0
## - hi        1     37169  540682 3209.0
## - pax       1     55269  558781 3223.8
## - vacation  1     93001  596514 3253.1
## - sw        1    101235  604748 3259.3
## - distance  1    653326 1156839 3550.5

Now we can check which variables were dropped in the stepwise regression:

step_model
## 
## Call:
## lm(formula = fare ~ new + vacation + sw + hi + s_income + e_income + 
##     s_pop + e_pop + slot + gate + distance + pax, data = train_airfares)
## 
## Coefficients:
##   (Intercept)            new    vacationYes          swYes             hi  
##  30.873320641   -4.521016297  -36.433723009  -39.832892403    0.006293746  
##      s_income       e_income          s_pop          e_pop       slotFree  
##   0.001254125    0.001645024    0.000003637    0.000004882  -16.720450234  
##      gateFree       distance            pax  
## -24.482100200    0.071206980   -0.001066315

This function reflects a model that dropped the variables coupon and new. Let’s check this model and call it model1.

model1 <- lm(fare ~ vacation + sw + hi + s_income + e_income + s_pop + e_pop + slot + gate + distance + pax, data = train_airfares)

summary(model1)
## 
## Call:
## lm(formula = fare ~ vacation + sw + hi + s_income + e_income + 
##     s_pop + e_pop + slot + gate + distance + pax, data = train_airfares)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.494 -20.583  -1.079  19.404 107.004 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)  21.0906196576  24.8650602431   0.848             0.396789    
## vacationYes -36.0005783086   4.0696256253  -8.846 < 0.0000000000000002 ***
## swYes       -39.7430471198   4.2700177449  -9.307 < 0.0000000000000002 ***
## hi            0.0061071890   0.0011098100   5.503      0.0000000637592 ***
## s_income      0.0012155887   0.0006152769   1.976             0.048820 *  
## e_income      0.0016050610   0.0004447919   3.609             0.000344 ***
## s_pop         0.0000036527   0.0000007538   4.845      0.0000017573978 ***
## e_pop         0.0000049468   0.0000008778   5.636      0.0000000313070 ***
## slotFree    -16.1206963093   4.4536261111  -3.620             0.000330 ***
## gateFree    -24.2961731658   4.7358432038  -5.130      0.0000004358992 ***
## distance      0.0705369455   0.0029872119  23.613 < 0.0000000000000002 ***
## pax          -0.0010711740   0.0001546931  -6.925      0.0000000000157 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.11 on 437 degrees of freedom
## Multiple R-squared:  0.8002, Adjusted R-squared:  0.7952 
## F-statistic: 159.1 on 11 and 437 DF,  p-value: < 0.00000000000000022

All variables significantly predict the airfares.

So, we have 2 models - model0 and model1 - and now we will create two more models - model2 and model3. Afterwards, we will assess the performance metrics of each one of the 4 models.

For model2, we will remove the variables s_pop, e_pop, slot, and gate.

# create model2
model2 <- lm(fare ~ vacation + sw + hi + s_income + e_income + distance + pax, data = train_airfares)

summary(model2)
## 
## Call:
## lm(formula = fare ~ vacation + sw + hi + s_income + e_income + 
##     distance + pax, data = train_airfares)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.318  -23.866   -1.843   23.735  106.494 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  -0.1679937  24.1969685  -0.007              0.99446    
## vacationYes -49.7678724   4.1505747 -11.991 < 0.0000000000000002 ***
## swYes       -53.9011515   4.4273744 -12.175 < 0.0000000000000002 ***
## hi            0.0049267   0.0011901   4.140           0.00004164 ***
## s_income      0.0015086   0.0005732   2.632              0.00878 ** 
## e_income      0.0022815   0.0004674   4.882           0.00000147 ***
## distance      0.0717944   0.0031951  22.470 < 0.0000000000000002 ***
## pax          -0.0004592   0.0001491  -3.079              0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.95 on 441 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7464 
## F-statistic: 189.4 on 7 and 441 DF,  p-value: < 0.00000000000000022

For model3, we will keep only the variables vacation, sw, distance, and pax.

# create model3
model3 <- lm(fare ~ vacation + sw + distance + pax, data = train_airfares)

summary(model3)
## 
## Call:
## lm(formula = fare ~ vacation + sw + distance + pax, data = train_airfares)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.473  -25.127   -3.374   23.979  124.562 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 133.0357263   4.7533427  27.988 <0.0000000000000002 ***
## vacationYes -55.7762784   4.1795977 -13.345 <0.0000000000000002 ***
## swYes       -64.5156905   4.2763749 -15.087 <0.0000000000000002 ***
## distance      0.0680762   0.0030424  22.376 <0.0000000000000002 ***
## pax          -0.0003274   0.0001399  -2.341              0.0197 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 444 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.7195 
## F-statistic: 288.2 on 4 and 444 DF,  p-value: < 0.00000000000000022

Assssment of the Regression Models

So, we’ve built our models, however we haven’t assessed which one performs better. In the following assessment, we will center our attention in two key performance metrics: the root mean squared error(RMSE) and the R-squared. I will show you how to assess the models with the caret and yardstick packages.

# Assessing model Performance
# model0
lm0 <- train(fare ~ ., method = "lm", data = train_airfares,
                trControl = trainControl(method = "none"))

# model1
lm1 <- train(fare ~ vacation + sw + hi + s_income + e_income + 
               s_pop + e_pop + slot + gate + distance + pax, method = "lm", data = train_airfares,
                 trControl = trainControl(method = "none"))

# model2
lm2 <- train(fare ~ vacation + sw + hi + s_income + e_income + distance + pax, method = "lm", data = train_airfares,
                 trControl = trainControl(method = "none"))

# model3
lm3 <- train(fare ~ vacation + sw + distance + pax, method = "lm", data = train_airfares,
                 trControl = trainControl(method = "none"))


# create dataframe with the 4 models
agg_data <- train_airfares %>%
  mutate(regression0 = predict(lm0, train_airfares),
         regression1 = predict(lm1, train_airfares),
         regression2 = predict(lm2, train_airfares),
         regression3 = predict(lm3, train_airfares))

# use the function metrics from the yardstick package to assess the models
# metrics model0
metrics(agg_data, truth = fare, estimate = regression0)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      33.4  
## 2 rsq     standard       0.803
## 3 mae     standard      26.0
# metrics model1
metrics(agg_data, truth = fare, estimate = regression1)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      33.7  
## 2 rsq     standard       0.800
## 3 mae     standard      26.2
# metrics model2
metrics(agg_data, truth = fare, estimate = regression2)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      37.6  
## 2 rsq     standard       0.750
## 3 mae     standard      29.6
# metrics model3
metrics(agg_data, truth = fare, estimate = regression3)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      39.7  
## 2 rsq     standard       0.722
## 3 mae     standard      31.0

Thus, the metrics of these models show that model0 and model1 are the best ones. They have a lower RMSE and a higher R-squared. Between these two, we should choose model1 as it includes less variables. Nonetheless, we are still left with 2 steps.

First step, let’s check these 4 models visually.

# visualized assessment of the models
agg_data %>%
  gather(type_of_lm, output, regression0:regression3) %>%
  ggplot(aes(fare, output, color = type_of_lm)) +
  geom_point(size = 1.5, alpha = 0.5) +
  facet_wrap(~type_of_lm) +
  geom_abline(lty = 2, color = "gray50") +
  geom_smooth(method = "lm")

It supports what we have said before, models 0 and 1 are the best ones. The regression line of both models is closer to the grey dashed line than the models 2 and 3.

Generalization of the Regression Models to new Data

The last step is a critical one in machine learning.

Firstly, we have created and fitted the models.

Secondly, we have assessed the models and it seems model0 and model1 are the best fit. However , we still do not know how the models compare when making predictions on new data. So, in this last step we need to check how our models work in our test data.

# generalization to new data
# create dataframe with the 4 models
tests_agg_data <- test_airfares %>%
  mutate(regression0t = predict(lm0, test_airfares),
         regression1t = predict(lm1, test_airfares),
         regression2t = predict(lm2, test_airfares),
         regression3t = predict(lm3, test_airfares))

# use the yardstick package to check metrics
# metrics model0
metrics(tests_agg_data, truth = fare, estimate = regression0t)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      39.6  
## 2 rsq     standard       0.739
## 3 mae     standard      32.1
# metrics model1
metrics(tests_agg_data, truth = fare, estimate = regression1t)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      39.2  
## 2 rsq     standard       0.746
## 3 mae     standard      31.6
# metrics model2
metrics(tests_agg_data, truth = fare, estimate = regression2t)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      40.6  
## 2 rsq     standard       0.726
## 3 mae     standard      32.5
# metrics model3
metrics(tests_agg_data, truth = fare, estimate = regression3t)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      43.6  
## 2 rsq     standard       0.688
## 3 mae     standard      34.6

We should also visualize our models.

# visualized assessment of the models
tests_agg_data %>%
  gather(type_of_lm, output, regression0t:regression3t) %>%
  ggplot(aes(fare, output, color = type_of_lm)) +
  geom_point(size = 1.5, alpha = 0.5) +
  facet_wrap(~type_of_lm) +
  geom_abline(lty = 2, color = "gray50") +
  geom_smooth(method = "lm")

The generalization shows that models0 and models1 still have a higher R-squared and a lower RMSE that the other models. However, we can attest that in test data the RMSE is higher and the R-squared is lower than in the train data. This means our model is slightly overfit.

Choosing the Model

In sum, we have created, assessed and evaluated our models. Now, we need to choose the model. Given the lower RMSE and higher R-squared compared to model2 and model3, as well as a lower number of variables than model0, we should choose model1 to make predictions.

Now that we have our final model, model1, let us imagine that we needed to predict the average fare on a route with these characteristics: vacation = no, sw = yes, hi = 3980, s_income = $35,000, e_income = $45,344, s_pop = 3,975,003, e_pop = 6,327,987, slot = free, gate = free, distance = 2410 miles, pax = 14200.

First, we should create a new data frame with this data.

# making predictions
new_df <- data.frame(vacation = "Yes", 
                     sw = "Yes",
                     hi = 3980,
                     s_income = 35000,
                     e_income = 45344,
                     s_pop = 3975003,
                     e_pop = 6327987,
                     slot = "Free",
                     gate = "Free",
                     distance = 2410,
                     pax = 14200)

Next, we should use the predict() function with two arguments: model1 and the new data frame.

predict(model1, newdata = new_df)
##        1 
## 245.1689

Our model predicts that a route with these characteristics will have an average price of $245.169.

That’s the amazing thing with machine learning: Its predictive ability constantly persuades me to learn more and more about it. Of course we should not take it at face value. Our model has limitations and a more advanced algorithm would likely given us more predictive power.

I hope you have enjoyed this post. I’m still giving my first steps in machine learning and in case I have made any mistake please free to point it out. Thanks a lot and keep coding!