Creating a Model to Predict if a Bank Customer accepts Personal Loans
In this post, we will fit a multiple logistic regression model to predict the probability of a bank customer accepting a personal loan based on multiple variables to be described later. Logistic regression is a supervised learning algorithm were the independent variable has a qualitative nature. In this case, corresponding to the acceptance or rejection of a personal loan. This tutorial will build multiple logistic regression models and assess them.
The data called UniversalBank comes again from the handbook ‘Data Mining for Business Analytics: Concepts, Techniques, and Applications in R’. The bank’s business goal is to find the best combination of variables that can increase the probability of loan acceptance.
Data Exploration
First, we must load our libraries.
library(here)
library(tidyverse) # data wrangling, data visualization
library(broom) # tidy statistics
library(caret) # apply machine learning algorithms
library(janitor) # tidy dataframes
library(MASS) # in this case it's used for the stepwise regression
library(readxl) # open excel files
options(scipen = 999) # number formatting to not include scientific notation
Afterwards, we load our data frame and explore it. We also need to transform into factors a couple of our variables so that they are interpretable by our logistic regression models.
# open dataset
bank <- read_excel(here::here("UniversalBank.xlsx"),
skip = 3,
sheet = 2) %>%
clean_names() %>%
mutate_at(vars(education, personal_loan, securities_account, cd_account,
online, credit_card), funs(as.factor))
glimpse(bank)
## Observations: 5,000
## Variables: 14
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ age <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 65,...
## $ experience <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5, ...
## $ income <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180, 1...
## $ zip_code <dbl> 91107, 90089, 94720, 94112, 91330, 92121, 9...
## $ family <dbl> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4, 1...
## $ cc_avg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0.6...
## $ education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 1...
## $ mortgage <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0, ...
## $ personal_loan <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ securities_account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ cd_account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0...
## $ credit_card <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
Here we can see in more detail each variable description:
Let’s continue our data exploration. We can summarize our data:
# explore dataset
summary(bank)
## id age experience income
## Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00
## 1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00
## Median :2500 Median :45.00 Median :20.0 Median : 64.00
## Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77
## 3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00
## Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00
## zip_code family cc_avg education
## Min. : 9307 Min. :1.000 Min. : 0.000 1:2096
## 1st Qu.:91911 1st Qu.:1.000 1st Qu.: 0.700 2:1403
## Median :93437 Median :2.000 Median : 1.500 3:1501
## Mean :93153 Mean :2.396 Mean : 1.938
## 3rd Qu.:94608 3rd Qu.:3.000 3rd Qu.: 2.500
## Max. :96651 Max. :4.000 Max. :10.000
## mortgage personal_loan securities_account cd_account online
## Min. : 0.0 0:4520 0:4478 0:4698 0:2016
## 1st Qu.: 0.0 1: 480 1: 522 1: 302 1:2984
## Median : 0.0
## Mean : 56.5
## 3rd Qu.:101.0
## Max. :635.0
## credit_card
## 0:3530
## 1:1470
##
##
##
##
It’s also important to check for any missing values.
# check missing values
sum(is.na(bank))
## [1] 0
Now that we have confirmed no values are missing, we can start with the visual exploration:
# boxplot
bank %>%
ggplot(aes(x = personal_loan, y = age, colour = personal_loan)) +
geom_boxplot()
# histogram
bank %>%
ggplot(aes(income, fill = personal_loan)) +
geom_histogram(colour = "black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like the median age of persons who accepted the loan is similar to the median age of persons who did not accept it. However, the histogram shows that much less customers accepted the loan compared to the ones who rejected it.
This class imbalance, that is, many more people whose loan was rejected than accepted can be an issue to be taken into account when building our model. This post will focus on it later on .
For now, we can compute this imbalance in a more concrete way.
bank %>%
count(personal_loan) %>%
mutate(prop = n / sum(n))
## # A tibble: 2 x 3
## personal_loan n prop
## <fct> <int> <dbl>
## 1 0 4520 0.904
## 2 1 480 0.096
Thus, only around 10% of our customers accepted the loan.
Building Logistic Regression Models
In this section, we will build our train and test datasets. The goal is to fit our training model, apply it in a testing data frame for validation and to see if it generalizes to new data. In order to create a train and test dataset, we can use the createDataPartition()
function from the caret
package to make this partition. But before that we should tidy our data frame.
bank_tidy <- bank %>%
dplyr::select(-id, -zip_code) # delete columns id and zipcode - not relevant for the logistic regression model
glimpse(bank) # overview of the dataframe
## Observations: 5,000
## Variables: 14
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ age <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 65,...
## $ experience <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5, ...
## $ income <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180, 1...
## $ zip_code <dbl> 91107, 90089, 94720, 94112, 91330, 92121, 9...
## $ family <dbl> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4, 1...
## $ cc_avg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0.6...
## $ education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 1...
## $ mortgage <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0, ...
## $ personal_loan <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ securities_account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ cd_account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0...
## $ credit_card <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
bank_tidy$personal_loan <- factor(bank_tidy$personal_loan, levels=c(0,1), labels=c("No","Yes")) # label our dependent variable
At this time we create a train and test datasets:
set.seed(1234)
partition <- createDataPartition(bank_tidy$personal_loan,
p = 0.7,
list = FALSE)
train <- bank_tidy[partition, ]
test <- bank_tidy[-partition, ]
Now, we will create our first logistic regression model called model0:
model0 <- glm(personal_loan ~., data = train,
family = "binomial")
summary(model0)
##
## Call:
## glm(formula = personal_loan ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0710 -0.1969 -0.0750 -0.0243 3.8229
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.6886570 2.1951529 -5.325 0.000000101087806
## age -0.0450697 0.0817190 -0.552 0.581277
## experience 0.0551820 0.0810369 0.681 0.495904
## income 0.0567793 0.0033629 16.884 < 0.0000000000000002
## family 0.6535239 0.0907246 7.203 0.000000000000587
## cc_avg 0.2176102 0.0514232 4.232 0.000023187946205
## education2 3.6701132 0.3076904 11.928 < 0.0000000000000002
## education3 3.6990124 0.3052904 12.116 < 0.0000000000000002
## mortgage 0.0006775 0.0007203 0.941 0.346916
## securities_account1 -0.8186157 0.3396475 -2.410 0.015944
## cd_account1 3.6962953 0.4023738 9.186 < 0.0000000000000002
## online1 -0.7498566 0.1940857 -3.864 0.000112
## credit_card1 -1.1885202 0.2605629 -4.561 0.000005082427535
##
## (Intercept) ***
## age
## experience
## income ***
## family ***
## cc_avg ***
## education2 ***
## education3 ***
## mortgage
## securities_account1 *
## cd_account1 ***
## online1 ***
## credit_card1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2213.43 on 3499 degrees of freedom
## Residual deviance: 847.63 on 3487 degrees of freedom
## AIC: 873.63
##
## Number of Fisher Scoring iterations: 8
The model shows that all variables predict a bank costumer loan acceptance, but age, professional experience, and mortgage value. The coefficient estimates show a relationship between the predictors and the dependent variable on a log-odds scale. For instance, an increase of one person in a family is associated with an increase in the log odds of a personal loan acceptance by 0.654 units.
This can be interpreted in a more concrete manner by computing the odds.
exp(coef(model0))
## (Intercept) age experience
## 0.000008388432 0.955930861623 1.056732872249
## income family cc_avg
## 1.058422211865 1.922302973299 1.243102411023
## education2 education3 mortgage
## 39.256351249957 40.407379004231 1.000677707014
## securities_account1 cd_account1 online1
## 0.441041769988 40.297738262092 0.472434286781
## credit_card1
## 0.304671776448
This shows that an increase of one family member, increases the odds of a bank costumer accepting a personal loan by a factor of 1.922.
Let’s now check if we have outliers in our data set. In case we find them , we will not delete them; it’s just something to have in mind while doing this type of analysis. Note: we will do this analysis only for model0. First, we should use the augment()
function from the broom
package to get the residuals data of our statistical model. In the next step, we should use ggplot2
to visualize possible outliers.
#- check for some assumptions
# Residual Assessment
tidy_mod0 <- broom::augment(model0) %>%
mutate(index = 1:n())
tidy_mod0 %>%
ggplot(aes(x = index, y = .std.resid, colour = personal_loan)) +
geom_point(alpha = 0.4)
It looks like we have some cases above 3. Let’s check how many there are and the corresponding number of the case.
tidy_mod0 %>%
filter(abs(.std.resid) > 3) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 8
plot(model0, which = 4, id.n = 8)
In total, we have 8 cases exceeding 3 and in the graph above we can see the number of the corresponding case.
Now, we should move on and focus in fitting our training data. For the creation of model1, we will use a stepwise regression with the function stepAIC()
from the MASS
package.
# model1
step_log <- stepAIC(model0, direction = "both")
## Start: AIC=873.63
## personal_loan ~ age + experience + income + family + cc_avg +
## education + mortgage + securities_account + cd_account +
## online + credit_card
##
## Df Deviance AIC
## - age 1 847.94 871.94
## - experience 1 848.10 872.10
## - mortgage 1 848.51 872.51
## <none> 847.63 873.63
## - securities_account 1 853.99 877.99
## - online 1 862.82 886.82
## - cc_avg 1 866.06 890.06
## - credit_card 1 871.61 895.61
## - family 1 905.35 929.35
## - cd_account 1 948.44 972.44
## - education 2 1110.12 1132.12
## - income 1 1353.89 1377.89
##
## Step: AIC=871.94
## personal_loan ~ experience + income + family + cc_avg + education +
## mortgage + securities_account + cd_account + online + credit_card
##
## Df Deviance AIC
## - mortgage 1 848.82 870.82
## - experience 1 849.79 871.79
## <none> 847.94 871.94
## + age 1 847.63 873.63
## - securities_account 1 854.20 876.20
## - online 1 863.10 885.10
## - cc_avg 1 866.47 888.47
## - credit_card 1 871.87 893.87
## - family 1 905.49 927.49
## - cd_account 1 949.00 971.00
## - education 2 1113.95 1133.95
## - income 1 1362.70 1384.70
##
## Step: AIC=870.82
## personal_loan ~ experience + income + family + cc_avg + education +
## securities_account + cd_account + online + credit_card
##
## Df Deviance AIC
## - experience 1 850.63 870.63
## <none> 848.82 870.82
## + mortgage 1 847.94 871.94
## + age 1 848.51 872.51
## - securities_account 1 855.11 875.11
## - online 1 863.90 883.90
## - cc_avg 1 866.69 886.69
## - credit_card 1 873.04 893.04
## - family 1 906.55 926.55
## - cd_account 1 950.10 970.10
## - education 2 1114.58 1132.58
## - income 1 1375.72 1395.72
##
## Step: AIC=870.63
## personal_loan ~ income + family + cc_avg + education + securities_account +
## cd_account + online + credit_card
##
## Df Deviance AIC
## <none> 850.63 870.63
## + experience 1 848.82 870.82
## + age 1 848.98 870.98
## + mortgage 1 849.79 871.79
## - securities_account 1 856.79 874.79
## - online 1 865.53 883.53
## - cc_avg 1 867.64 885.64
## - credit_card 1 874.19 892.19
## - family 1 907.85 925.85
## - cd_account 1 951.40 969.40
## - education 2 1115.49 1131.49
## - income 1 1376.23 1394.23
We can see which variables were kept for our model.
step_log
##
## Call: glm(formula = personal_loan ~ income + family + cc_avg + education +
## securities_account + cd_account + online + credit_card, family = "binomial",
## data = train)
##
## Coefficients:
## (Intercept) income family
## -12.53157 0.05694 0.65103
## cc_avg education2 education3
## 0.20736 3.63049 3.62850
## securities_account1 cd_account1 online1
## -0.80475 3.69255 -0.74104
## credit_card1
## -1.17166
##
## Degrees of Freedom: 3499 Total (i.e. Null); 3490 Residual
## Null Deviance: 2213
## Residual Deviance: 850.6 AIC: 870.6
Now, it’s time to create our model1.
model1 <- glm(personal_loan ~ income + family + cc_avg + education +
securities_account + cd_account + online + credit_card, family = "binomial",
data = train)
summary(model1)
##
## Call:
## glm(formula = personal_loan ~ income + family + cc_avg + education +
## securities_account + cd_account + online + credit_card, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0644 -0.2006 -0.0757 -0.0248 3.8089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.531566 0.649065 -19.307 < 0.0000000000000002 ***
## income 0.056935 0.003338 17.057 < 0.0000000000000002 ***
## family 0.651028 0.090765 7.173 0.000000000000735 ***
## cc_avg 0.207358 0.051010 4.065 0.000048021236068 ***
## education2 3.630494 0.305323 11.891 < 0.0000000000000002 ***
## education3 3.628497 0.299617 12.110 < 0.0000000000000002 ***
## securities_account1 -0.804754 0.338964 -2.374 0.017589 *
## cd_account1 3.692554 0.401937 9.187 < 0.0000000000000002 ***
## online1 -0.741044 0.193565 -3.828 0.000129 ***
## credit_card1 -1.171660 0.258974 -4.524 0.000006061455975 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2213.43 on 3499 degrees of freedom
## Residual deviance: 850.63 on 3490 degrees of freedom
## AIC: 870.63
##
## Number of Fisher Scoring iterations: 8
All variables predict the loan acceptance. We will create one last model, called model2. The most important predictors present in model1 will be applied to this last model. To get the information about the most relevant predictors, we will use the varImp()
from the caret
package.
caret::varImp(model1) %>%
tibble::rownames_to_column("variable") %>%
arrange(desc(Overall))
## variable Overall
## 1 income 17.057007
## 2 education3 12.110460
## 3 education2 11.890650
## 4 cd_account1 9.186890
## 5 family 7.172696
## 6 credit_card1 4.524234
## 7 cc_avg 4.065055
## 8 online1 3.828393
## 9 securities_account1 2.374156
We will choose some of the most important variables and create model2.
# model2
model2 <- glm(personal_loan ~ income + family + education + cd_account + credit_card + cc_avg +
online, family = "binomial",
data = train)
summary(model2)
##
## Call:
## glm(formula = personal_loan ~ income + family + education + cd_account +
## credit_card + cc_avg + online, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9380 -0.2015 -0.0754 -0.0249 3.8286
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.67077 0.64658 -19.596 < 0.0000000000000002 ***
## income 0.05731 0.00333 17.211 < 0.0000000000000002 ***
## family 0.65229 0.09047 7.210 0.000000000000559 ***
## education2 3.65868 0.30516 11.989 < 0.0000000000000002 ***
## education3 3.62688 0.29823 12.161 < 0.0000000000000002 ***
## cd_account1 3.23195 0.34597 9.342 < 0.0000000000000002 ***
## credit_card1 -1.06774 0.25258 -4.227 0.000023642291067 ***
## cc_avg 0.20572 0.05079 4.051 0.000051093604328 ***
## online1 -0.70415 0.19220 -3.664 0.000249 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2213.43 on 3499 degrees of freedom
## Residual deviance: 856.79 on 3491 degrees of freedom
## AIC: 874.79
##
## Number of Fisher Scoring iterations: 8
Assssment of the Logistic Regression Models
Now with the help of the caret
package we have to assess which model has a better performance. We will use three key performance metrics: accuracy, ppv, or positive predicted values, and npv, negative predicted values. Accuracy corresponds to the True Positives (TP) + the True Negatives(TN) divided by the TP + TN + False Positive(FP) + False Negatives(FN). PPV corresponds to the cases rightfully identified as positive(TP) divided by the TP + FP. The NPV is the number of cases rightfully identified as negative(TN) divided by the TN + FN.
It’s now time to build our models by using the function confusionMatrix
to compute their metrics. Note: For each model, a sampling method called upsampling will be used due to the imbalance present in our dependent variable.
- model0
# model0
glm0 <- train(personal_loan ~., method = "glm",
family = "binomial",
data = train,
trControl = trainControl(method = "none",
sampling = "up")) # upsampling use because
confusionMatrix(predict(glm0,
train), train$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2843 38
## Yes 321 298
##
## Accuracy : 0.8974
## 95% CI : (0.8869, 0.9073)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.9103
##
## Kappa : 0.5707
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.88690
## Specificity : 0.89855
## Pos Pred Value : 0.48142
## Neg Pred Value : 0.98681
## Prevalence : 0.09600
## Detection Rate : 0.08514
## Detection Prevalence : 0.17686
## Balanced Accuracy : 0.89273
##
## 'Positive' Class : Yes
##
- model1
glm1 <- train(personal_loan ~ income + family + cc_avg + education +
securities_account + cd_account + online + credit_card, method = "glm",
family = "binomial",
data = train,
trControl = trainControl(method = "none",
sampling = "up"))
confusionMatrix(predict(glm1,
train), train$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2843 39
## Yes 321 297
##
## Accuracy : 0.8971
## 95% CI : (0.8866, 0.907)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.919
##
## Kappa : 0.569
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.88393
## Specificity : 0.89855
## Pos Pred Value : 0.48058
## Neg Pred Value : 0.98647
## Prevalence : 0.09600
## Detection Rate : 0.08486
## Detection Prevalence : 0.17657
## Balanced Accuracy : 0.89124
##
## 'Positive' Class : Yes
##
- model2
glm2 <- train(personal_loan ~ income + family + education + cd_account +
credit_card + cc_avg + online, method = "glm",
family = "binomial",
data = train,
trControl = trainControl(method = "none",
sampling = "up"))
confusionMatrix(predict(glm2,
train), train$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2852 37
## Yes 312 299
##
## Accuracy : 0.9003
## 95% CI : (0.8899, 0.91)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.7816
##
## Kappa : 0.5794
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.88988
## Specificity : 0.90139
## Pos Pred Value : 0.48936
## Neg Pred Value : 0.98719
## Prevalence : 0.09600
## Detection Rate : 0.08543
## Detection Prevalence : 0.17457
## Balanced Accuracy : 0.89564
##
## 'Positive' Class : Yes
##
Looking at the 3 confusion matrices, it seems that model2 has a higher accuracy, ppv, and npv. Nonetheless, model0 and model1 are also highly accurate. However, we still have to check how these models generalize to new data.
Generalization of the Logistic Regression Models to new Data
Now, we should check how the models generalize to new data.
- model0
# model0
confusionMatrix(predict(glm0,
test), test$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1227 9
## Yes 129 135
##
## Accuracy : 0.908
## 95% CI : (0.8922, 0.9221)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.3181
##
## Kappa : 0.6138
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.9375
## Specificity : 0.9049
## Pos Pred Value : 0.5114
## Neg Pred Value : 0.9927
## Prevalence : 0.0960
## Detection Rate : 0.0900
## Detection Prevalence : 0.1760
## Balanced Accuracy : 0.9212
##
## 'Positive' Class : Yes
##
- model1
# model1
confusionMatrix(predict(glm1,
test), test$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1226 9
## Yes 130 135
##
## Accuracy : 0.9073
## 95% CI : (0.8915, 0.9215)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.3503
##
## Kappa : 0.6119
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.9375
## Specificity : 0.9041
## Pos Pred Value : 0.5094
## Neg Pred Value : 0.9927
## Prevalence : 0.0960
## Detection Rate : 0.0900
## Detection Prevalence : 0.1767
## Balanced Accuracy : 0.9208
##
## 'Positive' Class : Yes
##
- model2
# model2
confusionMatrix(predict(glm2,
test), test$personal_loan, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1226 11
## Yes 130 133
##
## Accuracy : 0.906
## 95% CI : (0.8901, 0.9203)
## No Information Rate : 0.904
## P-Value [Acc > NIR] : 0.4177
##
## Kappa : 0.6045
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.92361
## Specificity : 0.90413
## Pos Pred Value : 0.50570
## Neg Pred Value : 0.99111
## Prevalence : 0.09600
## Detection Rate : 0.08867
## Detection Prevalence : 0.17533
## Balanced Accuracy : 0.91387
##
## 'Positive' Class : Yes
##
All models maintain or slightly increase their metrics in the test dataset. Thus, the metrics of all models are good.
Nonetheless, to keep our model simpler, we will choose model2. Now that we have our final model, it’s time for some predictions. Let us imagine that we have two bank customers. They have the same characteristics in relation to the variables of our model2, but one, income. While customer A has an annual income of 100 thousand dollars, customer B has a 45 thousand dollars annual income. We can use the predict()
function to compare the probability of accepting a loan based on this income difference.
predict(model2, data.frame(income = c(100, 45),
family = c(3, 3),
cc_avg = c(0.8, 0.8),
education = c("3", "3"),
securities_account = c("1", "1"),
cd_account = c("1", "1"),
credit_card = c("1", "1"),
online = c("1", "1")),
type = "response")
## 1 2
## 0.56672287 0.05297574
As a result, the probability of accepting a loan is of 56.67% for customer A, while for customer B is only 5.30%.
As mentioned in a previous post, a more advanced algorithm could have given us more predictive power. So, we should always keep that in mind. Hope you liked this post. Thanks again and feel free to contact me!