A quick tour of mclust

Luca Scrucca

21 Nov 2017

Introduction

mclust is a contributed R package for model-based clustering, classification, and density estimation based on finite normal mixture modelling. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. Additional functionalities are available for displaying and visualizing fitted models along with clustering, classification, and density estimation results.

This document gives a quick tour of mclust (version 5.4) functionalities. It was written in R Markdown, using the knitr package for production. See help(package="mclust") for further details and references provided by citation("mclust").

library(mclust)
##     __  ___________    __  _____________
##    /  |/  / ____/ /   / / / / ___/_  __/
##   / /|_/ / /   / /   / / / /\__ \ / /   
##  / /  / / /___/ /___/ /_/ /___/ // /    
## /_/  /_/\____/_____/\____//____//_/    version 5.4
## Type 'citation("mclust")' for citing this R package in publications.

Clustering

data(diabetes)
class <- diabetes$class
table(class)
## class
## Chemical   Normal    Overt 
##       36       76       33
X <- diabetes[,-1]
head(X)
##   glucose insulin sspg
## 1      80     356  124
## 2      97     289  117
## 3     105     319  143
## 4      90     356  199
## 5      90     323  240
## 6      86     381  157
clPairs(X, class)


BIC <- mclustBIC(X)
plot(BIC)

summary(BIC)
## Best BIC values:
##              VVV,3       VVV,4       EVE,6
## BIC      -4751.316 -4784.32213 -4785.24591
## BIC diff     0.000   -33.00573   -33.92951

mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -2303.496 145 29 -4751.316 -4770.169
## 
## Clustering table:
##  1  2  3 
## 81 36 28 
## 
## Mixing probabilities:
##         1         2         3 
## 0.5368974 0.2650129 0.1980897 
## 
## Means:
##              [,1]     [,2]       [,3]
## glucose  90.96239 104.5335  229.42136
## insulin 357.79083 494.8259 1098.25990
## sspg    163.74858 309.5583   81.60001
## 
## Variances:
## [,,1]
##          glucose    insulin       sspg
## glucose 57.18044   75.83206   14.73199
## insulin 75.83206 2101.76553  322.82294
## sspg    14.73199  322.82294 2416.99074
## [,,2]
##           glucose   insulin       sspg
## glucose  185.0290  1282.340  -509.7313
## insulin 1282.3398 14039.283 -2559.0251
## sspg    -509.7313 -2559.025 23835.7278
## [,,3]
##           glucose   insulin       sspg
## glucose  5529.250  20389.09  -2486.208
## insulin 20389.088  83132.48 -10393.004
## sspg    -2486.208 -10393.00   2217.533

plot(mod1, what = "classification")

table(class, mod1$classification)
##           
## class       1  2  3
##   Chemical  9 26  1
##   Normal   72  4  0
##   Overt     0  6 27

par(mfrow = c(2,2))
plot(mod1, what = "uncertainty", dimens = c(2,1), main = "")
plot(mod1, what = "uncertainty", dimens = c(3,1), main = "")
plot(mod1, what = "uncertainty", dimens = c(2,3), main = "")
par(mfrow = c(1,1))


ICL = mclustICL(X)
summary(ICL)
## Best ICL values:
##              VVV,3       EVE,6       EVE,7
## ICL      -4770.169 -4797.38232 -4797.50566
## ICL diff     0.000   -27.21342   -27.33677
plot(ICL)


LRT = mclustBootstrapLRT(X, modelName = "VVV")
LRT
## Bootstrap sequential LRT for the number of mixture components
## -------------------------------------------------------------
## Model        = VVV 
## Replications = 999 
##               LRTS bootstrap p-value
## 1 vs 2   361.16739             0.001
## 2 vs 3   123.49685             0.001
## 3 vs 4    16.76161             0.482

Classification

EDDA

data(iris)
class <- iris$Species
table(class)
## class
##     setosa versicolor  virginica 
##         50         50         50
X <- iris[,1:4]
head(X)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
mod2 <- MclustDA(X, class, modelType = "EDDA")
summary(mod2)
## ------------------------------------------------
## Gaussian finite mixture model for classification 
## ------------------------------------------------
## 
## EDDA model summary:
## 
##  log.likelihood   n df       BIC
##       -187.7097 150 36 -555.8024
##             
## Classes       n Model G
##   setosa     50   VEV 1
##   versicolor 50   VEV 1
##   virginica  50   VEV 1
## 
## Training classification summary:
## 
##             Predicted
## Class        setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          0        50
## 
## Training error = 0.02
plot(mod2, what = "scatterplot")

plot(mod2, what = "classification")

MclustDA

data(banknote)
class <- banknote$Status
table(class)
## class
## counterfeit     genuine 
##         100         100
X <- banknote[,-1]
head(X)
##   Length  Left Right Bottom  Top Diagonal
## 1  214.8 131.0 131.1    9.0  9.7    141.0
## 2  214.6 129.7 129.7    8.1  9.5    141.7
## 3  214.8 129.7 129.7    8.7  9.6    142.2
## 4  214.8 129.7 129.6    7.5 10.4    142.0
## 5  215.0 129.6 129.7   10.4  7.7    141.8
## 6  215.7 130.8 130.5    9.0 10.1    141.4
mod3 <- MclustDA(X, class)
summary(mod3)
## ------------------------------------------------
## Gaussian finite mixture model for classification 
## ------------------------------------------------
## 
## MclustDA model summary:
## 
##  log.likelihood   n df       BIC
##       -646.0801 200 66 -1641.849
##              
## Classes         n Model G
##   counterfeit 100   EVE 2
##   genuine     100   XXX 1
## 
## Training classification summary:
## 
##              Predicted
## Class         counterfeit genuine
##   counterfeit         100       0
##   genuine               0     100
## 
## Training error = 0
plot(mod3, what = "scatterplot")

plot(mod3, what = "classification")

Cross-validation error

unlist(cvMclustDA(mod2, nfold = 10)[2:3])
##     error        se 
## 0.0200000 0.0101835
unlist(cvMclustDA(mod3, nfold = 10)[2:3])
## error    se 
##     0     0

Density estimation

Univariate

data(acidity)
mod4 <- densityMclust(acidity)
summary(mod4)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling 
## -------------------------------------------------------
## 
## Mclust E (univariate, equal variance) model with 2 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -185.9493 155  4 -392.0723 -398.5554
## 
## Clustering table:
##  1  2 
## 98 57
plot(mod4, what = "BIC")

plot(mod4, what = "density", data = acidity, breaks = 15)

plot(mod4, what = "diagnostic", type = "cdf")

plot(mod4, what = "diagnostic", type = "qq")

Multivariate

data(faithful)
mod5 <- densityMclust(faithful)
summary(mod5)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling 
## -------------------------------------------------------
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -1126.326 272 11 -2314.316 -2357.824
## 
## Clustering table:
##   1   2   3 
##  40  97 135
plot(mod5, what = "BIC")

plot(mod5, what = "density")

plot(mod5, what = "density", type = "image", col = "dodgerblue3", grid = 100)

plot(mod5, what = "density", type = "persp")

Bootstrap inference

boot1 <- MclustBootstrap(mod1, nboot = 999, type = "bs")
summary(boot1, what = "se")
## ----------------------------------------------------------
## Resampling standard errors 
## ----------------------------------------------------------
## Model                      = VVV 
## Num. of mixture components = 3 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## 
## Mixing probabilities:
##          1          2          3 
## 0.05419824 0.05076478 0.03439980 
## 
## Means:
##                 1         2         3
## glucose 0.9971095  3.495092 16.968865
## insulin 7.5755190 28.509588 66.873839
## sspg    7.6801499 31.104490  9.850315
## 
## Variances:
## [,,1]
##          glucose  insulin      sspg
## glucose 10.98617  52.3023  55.67768
## insulin 52.30230 514.5611 423.81468
## sspg    55.67768 423.8147 638.96053
## [,,2]
##           glucose   insulin      sspg
## glucose  73.19889  713.4218  454.1136
## insulin 713.42183 8223.6427 3408.2040
## sspg    454.11357 3408.2040 6701.2674
## [,,3]
##           glucose   insulin      sspg
## glucose 1050.8228  4222.914  656.1974
## insulin 4222.9140 19090.309 2472.4658
## sspg     656.1974  2472.466  490.7419
summary(boot1, what = "ci")
## ----------------------------------------------------------
## Resampling confidence intervals 
## ----------------------------------------------------------
## Model                      = VVV 
## Num. of mixture components = 3 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## Confidence level           = 0.95 
## 
## Mixing probabilities:
##               1         2         3
## 2.5%  0.4391053 0.1522011 0.1304827
## 97.5% 0.6531662 0.3523459 0.2650163
## 
## Means:
## [,,1]
##        glucose  insulin     sspg
## 2.5%  89.22400 343.2214 150.0103
## 97.5% 93.19995 373.9562 181.4140
## [,,2]
##         glucose  insulin     sspg
## 2.5%   98.82161 446.5122 258.6823
## 97.5% 112.49144 556.9686 372.5100
## [,,3]
##        glucose   insulin      sspg
## 2.5%  196.9730  967.6157  62.84289
## 97.5% 264.2195 1231.6510 101.30441
## 
## Variances:
## [,,1]
##        glucose  insulin     sspg
## 2.5%  38.30658 1261.947 1555.379
## 97.5% 81.45462 3217.946 4146.033
## [,,2]
##         glucose   insulin     sspg
## 2.5%   86.63346  3138.875 12422.98
## 97.5% 385.78229 33757.819 38375.98
## [,,3]
##        glucose   insulin     sspg
## 2.5%  3308.589  47536.77 1348.638
## 97.5% 7375.754 121281.74 3292.333

par(mfrow=c(4,3))
plot(boot1, what = "pro")
plot(boot1, what = "mean")

boot4 <- MclustBootstrap(mod4, nboot = 999, type = "bs")
summary(boot4, what = "se")
## ----------------------------------------------------------
## Resampling standard errors 
## ----------------------------------------------------------
## Model                      = E 
## Num. of mixture components = 2 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## 
## Mixing probabilities:
##          1          2 
## 0.03977832 0.03977832 
## 
## Means:
##          1          2 
## 0.04488707 0.06680960 
## 
## Variances:
##          1          2 
## 0.02458687 0.02458687
summary(boot4, what = "ci")
## ----------------------------------------------------------
## Resampling confidence intervals 
## ----------------------------------------------------------
## Model                      = E 
## Num. of mixture components = 2 
## Replications               = 999 
## Type                       = nonparametric bootstrap 
## Confidence level           = 0.95 
## 
## Mixing probabilities:
##               1         2
## 2.5%  0.5452403 0.2994824
## 97.5% 0.7005176 0.4547597
## 
## Means:
##              1        2
## 2.5%  4.280150 6.191873
## 97.5% 4.455027 6.456946
## 
## Variances:
##               1         2
## 2.5%  0.1436647 0.1436647
## 97.5% 0.2379912 0.2379912

par(mfrow=c(2,2))
plot(boot4, what = "pro")
plot(boot4, what = "mean")

Dimension reduction

Clustering

mod1dr <- MclustDR(mod1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification 
## -----------------------------------------------------------------
## 
## Mixture model type: Mclust (VVV, 3)
##         
## Clusters  n
##        1 81
##        2 36
##        3 28
## 
## Estimated basis vectors:
##              Dir1     Dir2      Dir3
## glucose -0.988671  0.76532 -0.966565
## insulin  0.142656 -0.13395  0.252109
## sspg    -0.046689  0.62955  0.046837
## 
##                Dir1     Dir2      Dir3
## Eigenvalues  1.3506  0.75608   0.53412
## Cum. %      51.1440 79.77436 100.00000
plot(mod1dr, what = "pairs")

plot(mod1dr, what = "boundaries", ngrid = 200)


mod1dr <- MclustDR(mod1, lambda = 1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification 
## -----------------------------------------------------------------
## 
## Mixture model type: Mclust (VVV, 3)
##         
## Clusters  n
##        1 81
##        2 36
##        3 28
## 
## Estimated basis vectors:
##              Dir1     Dir2
## glucose  0.764699  0.86359
## insulin -0.643961 -0.22219
## sspg     0.023438 -0.45260
## 
##                Dir1      Dir2
## Eigenvalues  1.2629   0.35218
## Cum. %      78.1939 100.00000
plot(mod1dr, what = "scatterplot")

plot(mod1dr, what = "boundaries", ngrid = 200)

Classification

mod2dr <- MclustDR(mod2)
summary(mod2dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification 
## -----------------------------------------------------------------
## 
## Mixture model type: EDDA 
##             
## Classes       n Model G
##   setosa     50   VEV 1
##   versicolor 50   VEV 1
##   virginica  50   VEV 1
## 
## Estimated basis vectors:
##                  Dir1      Dir2     Dir3     Dir4
## Sepal.Length  0.17425 -0.193663  0.64081 -0.46231
## Sepal.Width   0.45292  0.066561  0.34852  0.57110
## Petal.Length -0.61629 -0.311030 -0.42366  0.46256
## Petal.Width  -0.62024  0.928076  0.53703 -0.49613
## 
##                 Dir1     Dir2      Dir3       Dir4
## Eigenvalues  0.94747  0.68835  0.076141   0.052607
## Cum. %      53.69408 92.70374 97.018700 100.000000
plot(mod2dr, what = "scatterplot")

plot(mod2dr, what = "boundaries", ngrid = 200)


mod3dr <- MclustDR(mod3)
summary(mod3dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification 
## -----------------------------------------------------------------
## 
## Mixture model type: MclustDA 
##              
## Classes         n Model G
##   counterfeit 100   EVE 2
##   genuine     100   XXX 1
## 
## Estimated basis vectors:
##              Dir1      Dir2      Dir3      Dir4       Dir5      Dir6
## Length   -0.10139 -0.328225  0.797068 -0.033629 -0.3174275  0.085062
## Left     -0.21718 -0.305014 -0.303111 -0.893349  0.3700659 -0.565410
## Right     0.29222 -0.018401 -0.495891  0.407413 -0.8612986  0.480799
## Bottom    0.57591  0.445352  0.120173 -0.034595  0.0043174 -0.078640
## Top       0.57542  0.385535  0.100865 -0.103623  0.1359128  0.625902
## Diagonal -0.44089  0.672250 -0.047784 -0.151252 -0.0443255  0.209691
## 
##                 Dir1     Dir2     Dir3     Dir4      Dir5       Dir6
## Eigenvalues  0.87242  0.55373  0.48546  0.13291  0.053075   0.027273
## Cum. %      41.05755 67.11689 89.96377 96.21866 98.716489 100.000000
plot(mod3dr, what = "scatterplot")

plot(mod3dr, what = "boundaries", ngrid = 200)

Using colorblind-friendly palettes

Most of the graphs produced by mclust use colors that by default are defined in the following options:

mclust.options("bicPlotColors")
##       EII       VII       EEI       EVI       VEI       VVI       EEE 
##    "gray"   "black" "#218B21" "#41884F" "#508476" "#58819C" "#597DC3" 
##       EVE       VEE       VVE       EEV       VEV       EVV       VVV 
## "#5178EA" "#716EE7" "#9B60B8" "#B2508B" "#C03F60" "#C82A36" "#CC0000" 
##         E         V 
##    "gray"   "black"
mclust.options("classPlotColors")
##  [1] "dodgerblue2"    "red3"           "green3"         "slateblue"     
##  [5] "darkorange"     "skyblue1"       "violetred4"     "forestgreen"   
##  [9] "steelblue4"     "slategrey"      "brown"          "black"         
## [13] "darkseagreen"   "darkgoldenrod3" "olivedrab"      "royalblue"     
## [17] "tomato4"        "cyan2"          "springgreen2"

The first option controls colors used for plotting BIC, ICL, etc. curves, whereas the second option is used to assign colors for indicating clusters or classes when plotting data.

Color-blind-friendly palettes can be defined and assigned to the above options as follows:

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#999999", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
bicPlotColors <- mclust.options("bicPlotColors")
bicPlotColors[1:14] <- c(cbPalette, cbPalette[1:6])
mclust.options("bicPlotColors" = bicPlotColors)
mclust.options("classPlotColors" = cbPalette)

clPairs(iris[,-5], iris$Species)

mod <- Mclust(iris[,-5])
plot(mod, what = "BIC")

plot(mod, what = "classification")

The above color definitions are adapted from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/, but users can easily define their own palettes if needed.

References

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233. https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf

Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.

Fraley C., Raftery A. E., Murphy T. B. and Scrucca L. (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.