In this post I will use two of the most popular clustering methods, hierarchical clustering and k-means clustering, to analyse a data frame related to the financial variables of some pharmaceutical companies. Clustering is an unsupervised learning technique where we segment the data and identify meaningful groups that have similar characteristics. In our case, the goal will be to find these groups within the pharmaceutical companies data. Like we did in the previous posts we will start by loading the required packages to our analysis.

Note: This data is not up-to-date and some of these companies are no longer active.

library(tidyverse) # group of packages to wrangle and visualize data
library(cluster) # cluster analysis
library(factoextra) # visualize clusters and principal components
library(dendextend) # visualize dendrograms
library(here) # create a file directory
library(ggrepel) # repel overlapping text labels
library(clustree) # visualize clusters
library(FactoMineR) # explore multivariate data
library(ggcorrplot) # visualize correlations
library(clValid) # compute cluster metrics
library(broom) # tidy algorithm outputs
library(umap) # dimension reduction
library(tidyquant) # in this case theme and color for clusters visualization

The next step is to load our file and glimpse it.

# load the file
pharmaceuticals <- read_csv(here::here("pharmaceuticals.csv"))

# explore the data
glimpse(pharmaceuticals)
## Observations: 21
## Variables: 14
## $ Symbol                <chr> "ABT", "AGN", "AHM", "AZN", "AVE", "BAY"...
## $ Name                  <chr> "Abbott Laboratories", "Allergan, Inc.",...
## $ Market_Cap            <dbl> 68.44, 7.58, 6.30, 67.63, 47.16, 16.90, ...
## $ Beta                  <dbl> 0.32, 0.41, 0.46, 0.52, 0.32, 1.11, 0.50...
## $ PE_Ratio              <dbl> 24.7, 82.5, 20.7, 21.5, 20.1, 27.9, 13.9...
## $ ROE                   <dbl> 26.4, 12.9, 14.9, 27.4, 21.8, 3.9, 34.8,...
## $ ROA                   <dbl> 11.8, 5.5, 7.8, 15.4, 7.5, 1.4, 15.1, 4....
## $ Asset_Turnover        <dbl> 0.7, 0.9, 0.9, 0.9, 0.6, 0.6, 0.9, 0.6, ...
## $ Leverage              <dbl> 0.42, 0.60, 0.27, 0.00, 0.34, 0.00, 0.57...
## $ Rev_Growth            <dbl> 7.54, 9.16, 7.05, 15.00, 26.81, -3.17, 2...
## $ Net_Profit_Margin     <dbl> 16.1, 5.5, 11.2, 18.0, 12.9, 2.6, 20.6, ...
## $ Median_Recommendation <chr> "Moderate Buy", "Moderate Buy", "Strong ...
## $ Location              <chr> "US", "CANADA", "UK", "UK", "FRANCE", "G...
## $ Exchange              <chr> "NYSE", "NYSE", "NYSE", "NYSE", "NYSE", ...

We have 14 variables related to the financial information of 21 pharmaceutical companies.

Data Exploration

Before we start with our clustering analysis, we should explore our data. First, we will check the correlations between our numeric variables.

Note: for this clustering analysis we will only use the numeric variables.

# create correlation matrix
pharmaceuticals_cor <- pharmaceuticals %>%
  select_if(is.numeric) %>%
  cor()

# visualize correlations
ggcorrplot(pharmaceuticals_cor, 
           outline.color = "grey50", 
           lab = TRUE,
           hc.order = TRUE,
           type = "full") 

We can identify strong positive correlations between Revenue on Assets, Revenue on Equity, Net Profit Margin, and Market Cap which might indicate that these variables form a cluster.

Another way to recognize the likely number of clusters before we do our clustering analysis is to use a Principal Component Analysis(PCA). Let’s use the PCA from the FactoMineR and the fviz_screeplot function from the factoextra package to check how much variance is explained by each dimension.

# use a data frame only with numeric values and scale the variables because they were measured in different scales
pharmaceuticals_tbl <- na.omit(pharmaceuticals) %>%
  dplyr::select(-c(1, 12, 13, 14)) %>% 
  column_to_rownames(var = "Name") %>%
  scale(.) %>% # standardize the values 
  as.data.frame() # convert to data frame
## Warning: Setting row names on a tibble is deprecated.
## use PCA to check how many dimensions we have
# PCA of our dataframe
new_pca <- PCA(pharmaceuticals_tbl)

# check eigenvalues and percentage of variance
new_pca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  3.8080296             42.3114401                          42.31144
## comp 2  1.7028349             18.9203881                          61.23183
## comp 3  1.1435807             12.7064523                          73.93828
## comp 4  0.8157384              9.0637604                          83.00204
## comp 5  0.7071235              7.8569272                          90.85897
## comp 6  0.4538979              5.0433098                          95.90228
## comp 7  0.2337408              2.5971197                          98.49940
## comp 8  0.1154565              1.2828502                          99.78225
## comp 9  0.0195977              0.2177522                         100.00000
# visualization of how much variance each dimension explains
fviz_screeplot(new_pca, addlabels = TRUE)

So, we have 4 Principal Components that capture more than 80% of the variance. The first Principal Component explains 42.3% and the second 18.92%. Regarding the eigenvalues, we have 3 components with eigenvalues above 1.

We can still check the contribution of the top 5 variables for the first 4 Principal Components using the fviz_contrib function from the factoextra package.

# get each variable PCA results
var <- get_pca_var(new_pca)

# each variable contribution to PC1 - top 5
fviz_contrib(new_pca, choice = "var", axes = 1, top = 5)

# each variable contribution to PC2 - top 5
fviz_contrib(new_pca, choice = "var", axes = 2, top = 5)

# each variable contribution to PC3 - top 5
fviz_contrib(new_pca, choice = "var", axes = 3, top = 5)

# each variable contribution to PC - top 5
fviz_contrib(new_pca, choice = "var", axes = 4, top = 5)

In the first component Revenue on assets, Market Cap, Revenue on Equity, and Net Profit Margin are the variables with highest contribution. For the second component, Price/Earnings Ratio and Asset turnover are the highest contributors. In the third component Leverage and Revenue Growth have the highest contribution. Finally, in the fourth component the variables with the highest contribution are Revenue Growth and Beta.

Visualize the contribution of the variables for the first two components is possible through an axis.

# visualization of the first two components and the contributions of each variable
fviz_pca_var(new_pca, col.var="contrib",
             gradient.cols = c("red", "green", "blue"),
             repel = TRUE 
             ) + 
  labs( title = "Variables - PCA")

From this visualization, it’s clear that Revenue on assets, Revenue on equity, Net Profit Margin, Market Cap, and Asset Turnover tend to increase together and are the highest contributors to the first component.

Following the exploration of our data, let’s us start our clustering analysis.

Hierarchical Cluster Analysis

Hierarchical clustering is an approach to identify groups within a data frame where there is noneed to pre-specify the number of clusters intended to be extracted.

As we have already scaled our data including the removal of any missing values, we can start our clustering process. First, we should measure the similarity between our observations using an euclidean distance measure - other measures can be used such as the Manhattan distance. These distances can be visualized with the function fviz_dist from the factoextra.

# compute distance measure
dt <- dist(pharmaceuticals_tbl, method = "euclidean")


# visualize distance
fviz_dist(dt, gradient = list(low = "red", mid = "white", high = "blue"))

Choosing a method of linkage is the next step. It can be: complete which considers the maximal distance between clusters, simple where is taken into account the smallest distance between clusters, average where the average distance between clusters is considered, and ward method where the pair of clusters with the lowest distance is merged. We can create a function with agnes function - it’s also an agglomerative hierarchical clustering method - to check which method gives higher agglomerative coefficient, that is, a greater clustering structure.

m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")


#function to check the best (means higher value) linkage method 
ac <- function(x) {
  agnes(dt, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.5600652 0.4600348 0.6990833 0.7943164

Thus, the ward method give us a higher agglomerative coefficient. From now on, we will use this linkage method in our hierarchical clustering analysis.

# hierarchical clustering 
set.seed(88)
hclust_1 <- hclust(dt, method = "ward.D2") # ward.D2 corresponds to the ward                                                method in the hclust function

# plot hierarchical clustering
plot(hclust_1, cex = 0.6)

Now, we should determine the optimal number of clusters - this process will also be used to determine the number of clusters in the following k-means algorithm. We will use two methods: elbow and average silhouette.

# elbow method
fviz_nbclust(pharmaceuticals_tbl, FUNcluster = hcut, method = "wss")

# sillhouette method
fviz_nbclust(pharmaceuticals_tbl, FUNcluster = hcut, method = "silhouette")

From these visualizations we should have two clusters. Let’s use the cutree function to cut our dendogram in 2 clusters.

# cutree function
cl_1 <- cutree(hclust_1, k = 2)

# table function check the number of pharmaceutical companies in each cluster
table(cl_1)
## cl_1
##  1  2 
## 11 10

Consequently, 11 and 10 companies are displayed in cluster 1 and cluster 2 respectively.

More important, we can visualize how the pharmaceutical companies are distributed in each cluster.

plot(hclust_1, cex = 0.6)
rect.hclust(hclust_1, k = 2, border= 2:5)

We can also visualize in a two-dimensional graph our two clusters.

# fviz_cluster function to visualize the clusters
fviz_cluster(list(data = pharmaceuticals_tbl, cluster = cl_1, repel = TRUE)) +
  theme_minimal()

Now that we have our clusters and recognize the pharmaceutical companies belonging to each one, we need to check the most relevant characteristics and describe each cluster better. However, we still need to understand what each cluster is really about. For this, we need to create a cluster variable in our cutree cluster dataframe generated before. Afterwards, we should aggregate our data by cluster and verify the mean of each variable.

# create cluster variable
pharmaceuticals_tbl$cluster <- cl_1

# aggregate by cluster our variables
pharmaceuticals_tbl %>%
  group_by(cluster) %>%
  summarise_all(mean)
## # A tibble: 2 x 10
##   cluster Market_Cap   Beta PE_Ratio    ROE    ROA Asset_Turnover Leverage
##     <int>      <dbl>  <dbl>    <dbl>  <dbl>  <dbl>          <dbl>    <dbl>
## 1       1      0.673 -0.359   -0.276  0.657  0.834          0.461   -0.333
## 2       2     -0.741  0.395    0.304 -0.722 -0.918         -0.507    0.366
## # ... with 2 more variables: Rev_Growth <dbl>, Net_Profit_Margin <dbl>

Cluster 1 is characterized by pharmaceutical companies that are more profitable, have a higher market value, and are more efficient. While cluster 2 has companies that borrow more money (higher leverage), show more volatility(higher beta), but where the revenues increased more, but not the profits.

In a few words, cluster 1 can be described as profitable and low-risk investment type of pharmaceutical companies. Cluster 2 as non-profitable and high-risk investment.

K-Means Cluster Analysis

Now, we will start with the K-means clustering. In this type of clustering method, we have to determine the number of clusters before we do our analysis. Because of that, we could use the elbow and silhouette method.

Note Other methods to determine the optimal number of clusters will be used later.

# use a data frame only with numeric values and scale the variables because they were measured in different scales
pharmaceuticals_tbl <- na.omit(pharmaceuticals) %>%
  dplyr::select(-c(1, 12, 13, 14)) %>% 
  column_to_rownames(var = "Name") %>%
  scale(.) %>% # standardize the values 
  as.data.frame() # convert to data frame
## Warning: Setting row names on a tibble is deprecated.
# elbow method
fviz_nbclust(pharmaceuticals_tbl, FUNcluster = kmeans, method = "wss")

# sillhouette method
fviz_nbclust(pharmaceuticals_tbl, FUNcluster = kmeans, method = "silhouette")

From the elbow method we could choose 2, 3, or maybe 4 clusters. However, the silhouette method indicates that 2 clusters correspond to the optimal number.

Now, we can build our K-means cluster with two clusters.

set.seed(88)
k_cluster2 <- kmeans(pharmaceuticals_tbl, centers = 2, nstart = 50,
                    iter.max = 10) # k equals 2 clusters

table(k_cluster2$cluster)
## 
##  1  2 
## 10 11

As we have seen in the hierarchical clustering, we have 10 pharmaceutical in one cluster and 11 in the other one. Regardless, we could check some metrics of compactness and separation with the function glance from the broom package and with the dunn function.

# check total within and between sum of squares
glance(k_cluster2)
## # A tibble: 1 x 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1   180         119.      61.4     1
# dunn index
dunn_k2 <- dunn(clusters = k_cluster2$cluster, Data = pharmaceuticals_tbl)
dunn_k2
## [1] 0.2546142

We have a total within sum of squares (WSS) of 118.57 and between sum of squares (BSS) of 66.67. The WSS is a measure of compactness where a lower value is better, whereas BSS measures separation whose high value reflects a better outcome. The dunn index is 0.30. A higher value is considerably superior as the clusters are better separated and/ or more compact.

We can now try to see if with 3 clusters our performance metrics will be better. If so, we will segment our pharmaceutical with K equals 3. First, we will build or K-means cluster.

set.seed(88)
k_cluster3 <- kmeans(pharmaceuticals_tbl, centers = 3, nstart = 50,
                     iter.max = 10) # centers equals 3 clusters



table(k_cluster3$cluster)
## 
##  1  2  3 
##  6  4 11

The table of this k-means clustering shows 6 pharmaceutical companies in cluster 1, 4 in cluster 2, and 11 in cluster 3.

Now with the functions glance and dunn we will take a look at the performance metrics.

# check wSS and BSS
glance(k_cluster3)
## # A tibble: 1 x 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1   180         96.0      84.0     3
tidy(k_cluster3)
## # A tibble: 3 x 12
##       x1     x2     x3     x4     x5     x6     x7     x8     x9  size
## *  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <int>
## 1 -0.826  0.478 -0.370 -0.563 -0.851 -0.999  0.850  0.916 -0.332     6
## 2 -0.613  0.270  1.31  -0.961 -1.02   0.231 -0.359 -0.576 -1.38      4
## 3  0.673 -0.359 -0.276  0.657  0.834  0.461 -0.333 -0.290  0.682    11
## # ... with 2 more variables: withinss <dbl>, cluster <fct>
# check dunn index
dunn_k3 <- dunn(clusters = k_cluster3$cluster, Data = pharmaceuticals_tbl)
dunn_k3
## [1] 0.3076927

The WSS shows a lower value and the BSS a higher value compared with the K-means clustering with 2 clusters. The dunn index also has a higher value which means that with K equals 3 the clusters are more separated and have more compactness. In sum, the performance metrics of our k-means algorithm seem to be better with 3 clusters than with 2.

Moving on in our analysis, now we will use a different approach from the one described above where we used the fviz_cluster function to map our clusters in 2D format. In this case, we will use the UMAP (Uniform Manifold Approximation and Projection) algorithm, which is a dimension reduction algorithm similar to PCA. The UMAP algorithm reduces the variables in a data frame with two columns/dimensions corresponding to x and y coordinates. These two dimensions will capture the variability of our data in a two-dimensional set. So, it becomes possible to project in a two-dimensional graph the position of each pharmaceutical companies in each cluster.

Firstly, we will use the use the function umap in our data frame. The following step will be extracting the variable layout from the UMAP object to create a data frame with the coordinates of the respective pharmaceutical companies.

# umap our data frame
umap_pharma <- pharmaceuticals_tbl %>%
  umap()

# create umap dataframe
umap_obj <- umap_pharma$layout %>%
  as.data.frame() %>%
  rownames_to_column(var = "Pharma")

umap_obj
##                                Pharma         V1          V2
## 1                 Abbott Laboratories -0.5599971  0.46196012
## 2                      Allergan, Inc.  0.9289384  1.43291560
## 3                        Amersham plc  0.3354541  0.34965708
## 4                     AstraZeneca PLC -0.5293549 -0.56448968
## 5                             Aventis  1.5377670 -0.09310394
## 6                            Bayer AG  0.9640006  0.72732919
## 7        Bristol-Myers Squibb Company -1.1071940 -0.01250234
## 8                        Chattem, Inc  0.5180765  1.05588066
## 9               Elan Corporation, plc  0.9458157 -0.47147116
## 10              Eli Lilly and Company -1.1430508  0.76003736
## 11                GlaxoSmithKline plc -1.4867707 -1.15857252
## 12                   IVAX Corporation  0.8873017  0.18199628
## 13                  Johnson & Johnson -1.1974953 -0.74347125
## 14 Medicis Pharmaceutical Corporation  1.2166261 -0.93939240
## 15                  Merck & Co., Inc. -0.7444749 -1.11176241
## 16                        Novartis AG -0.7579310  1.06683888
## 17                         Pfizer Inc -1.4173004 -1.53152437
## 18              Pharmacia Corporation  1.6203149  0.77533872
## 19        Schering-Plough Corporation -0.1964202  0.04833752
## 20       Watson Pharmaceuticals, Inc.  1.8203646 -0.47752432
## 21                              Wyeth -1.6346701  0.24352297

We can also visualize the position of each pharmaceutical company in this UMAP 2D projection.

# visualize umap dataframe
umap_obj %>%
  ggplot(aes(V1, V2)) +
  geom_point() +
  geom_label_repel(aes(label = Pharma))

Yet our goal is to map these pharmaceutical companies with the respective cluster assignment. We will use augment from the broom package to assign each cluster to our respective pharmaceutical company. Right after , we will join the K-means data frame with the umap object.

# use augment to assign the clusters to our pharmaceutical companies
kmeans_tbl <- augment(k_cluster3, pharmaceuticals_tbl) %>% 
  dplyr::select(pharma = .rownames, .cluster)

# join the kmeans data frame with the umap object
kmeans_umap <- kmeans_tbl %>% 
  left_join(umap_obj, by = c("pharma" = "Pharma"))

kmeans_umap
## # A tibble: 21 x 4
##    pharma                       .cluster     V1      V2
##    <chr>                        <fct>     <dbl>   <dbl>
##  1 Abbott Laboratories          3        -0.560  0.462 
##  2 Allergan, Inc.               2         0.929  1.43  
##  3 Amersham plc                 2         0.335  0.350 
##  4 AstraZeneca PLC              3        -0.529 -0.564 
##  5 Aventis                      1         1.54  -0.0931
##  6 Bayer AG                     2         0.964  0.727 
##  7 Bristol-Myers Squibb Company 3        -1.11  -0.0125
##  8 Chattem, Inc                 1         0.518  1.06  
##  9 Elan Corporation, plc        1         0.946 -0.471 
## 10 Eli Lilly and Company        3        -1.14   0.760 
## # ... with 11 more rows

Now, we can visualize the clusters in a 2D projection.

kmeans_umap %>% 
  mutate(label_pharma = str_glue("Company: {pharma}
                                 Cluster:{.cluster}")) %>%
  ggplot(aes(V1, V2, color = .cluster)) +
  geom_point() +
  geom_label_repel(aes(label = label_pharma), size = 2.5) +
  guides(color = FALSE) +
  theme_minimal() +
  scale_color_tq() +
  labs(title = "Pharmaceutical Companies Segmentation",
       subtitle = "K-Means Cluster Algorithm with UMAP Projection")

As we did in the hierarchical clustering analysis, we need to check which are the characteristics associated with each cluster. So, we still need to find the real meaning of these clusters. We will use again the augment function to create a data frame of our K-means algorithm and find out how the variables profile in each cluster.

k_cluster3 %>%
  augment(pharmaceuticals_tbl) %>%
  dplyr::select(-.rownames) %>%
  group_by(.cluster) %>% 
  summarise_all(mean)
## # A tibble: 3 x 10
##   .cluster Market_Cap   Beta PE_Ratio    ROE    ROA Asset_Turnover Leverage
##   <fct>         <dbl>  <dbl>    <dbl>  <dbl>  <dbl>          <dbl>    <dbl>
## 1 1            -0.826  0.478   -0.370 -0.563 -0.851         -0.999    0.850
## 2 2            -0.613  0.270    1.31  -0.961 -1.02           0.231   -0.359
## 3 3             0.673 -0.359   -0.276  0.657  0.834          0.461   -0.333
## # ... with 2 more variables: Rev_Growth <dbl>, Net_Profit_Margin <dbl>

Cluster 3 shows a similar pattern as Cluster 1 from the Hierarchical Clustering algorithm. It is comprised of high profitable companies representing a low risk investment. Cluster 2 is composed of non-profit companies and overpriced stocks showing a high volatility (high Beta).Additionally, these companies do not seem to borrow a lot of money (low Leverage). Cluster 1 includes non-profit and high volatility companies as the companies in Cluster 2, and even considering revenue growth and high levels of borrowings, their stocks look underpriced.

Let’s create a tibble with the tribble function with the cluster number and the respective characteristics.

# create tibble withthe characteristics of the 3 cluster
cluster_tibble <- tibble::tribble(~.cluster, ~cluster.label,
                                  1, "Non Profitable/High Risk Investment/Underpriced Stocks",
                                  2, "Non Profitable/High Risk Investment/Overpriced Stocks",
                                  3, "Profitable/Low Risk Investment")


# make .cluster variable a factor
cluster_tibble <- cluster_tibble %>%
  mutate(.cluster = as.factor(as.character(.cluster)))

Now, we can visualize the clusters with respective description of such characteristics.

# clusters visualization
kmeans_umap %>%
  left_join(cluster_tibble) %>%
  mutate(label_pharma = str_glue("Company: {pharma}
                                 Cluster:{.cluster}
                                 {cluster.label}")) %>%
  ggplot(aes(V1, V2, color = .cluster)) +
  geom_point() +
  geom_label_repel(aes(label = label_pharma), size = 2) +
  guides(color = FALSE) +
  theme_tq() +
  scale_color_tq() +
  labs(title = "Pharmaceutical Companies Segmentation",
       subtitle = "UMAP 2D Projection with the K-Means Cluster Algorithm")
## Joining, by = ".cluster"

Conclusion

Hope you have enjoyed this introduction to clustering analysis. Despite the plethora of cluster algorithms, the goal was to show the power of this unsupervised algorithm to segment any kind of business. Talk to you through the following posts. In the meantime keep coding!