mclust is a contributed R package for model-based clustering, classification, and density estimation based on finite normal mixture modelling. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. Additional functionalities are available for displaying and visualizing fitted models along with clustering, classification, and density estimation results.
This document gives a quick tour of mclust (version 5.4) functionalities. It was written in R Markdown, using the knitr package for production. See help(package="mclust")
for further details and references provided by citation("mclust")
.
library(mclust)
## __ ___________ __ _____________
## / |/ / ____/ / / / / / ___/_ __/
## / /|_/ / / / / / / / /\__ \ / /
## / / / / /___/ /___/ /_/ /___/ // /
## /_/ /_/\____/_____/\____//____//_/ version 5.4
## Type 'citation("mclust")' for citing this R package in publications.
data(diabetes)
class <- diabetes$class
table(class)
## class
## Chemical Normal Overt
## 36 76 33
X <- diabetes[,-1]
head(X)
## glucose insulin sspg
## 1 80 356 124
## 2 97 289 117
## 3 105 319 143
## 4 90 356 199
## 5 90 323 240
## 6 86 381 157
clPairs(X, class)
BIC <- mclustBIC(X)
plot(BIC)
summary(BIC)
## Best BIC values:
## VVV,3 VVV,4 EVE,6
## BIC -4751.316 -4784.32213 -4785.24591
## BIC diff 0.000 -33.00573 -33.92951
mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
##
## log.likelihood n df BIC ICL
## -2303.496 145 29 -4751.316 -4770.169
##
## Clustering table:
## 1 2 3
## 81 36 28
##
## Mixing probabilities:
## 1 2 3
## 0.5368974 0.2650129 0.1980897
##
## Means:
## [,1] [,2] [,3]
## glucose 90.96239 104.5335 229.42136
## insulin 357.79083 494.8259 1098.25990
## sspg 163.74858 309.5583 81.60001
##
## Variances:
## [,,1]
## glucose insulin sspg
## glucose 57.18044 75.83206 14.73199
## insulin 75.83206 2101.76553 322.82294
## sspg 14.73199 322.82294 2416.99074
## [,,2]
## glucose insulin sspg
## glucose 185.0290 1282.340 -509.7313
## insulin 1282.3398 14039.283 -2559.0251
## sspg -509.7313 -2559.025 23835.7278
## [,,3]
## glucose insulin sspg
## glucose 5529.250 20389.09 -2486.208
## insulin 20389.088 83132.48 -10393.004
## sspg -2486.208 -10393.00 2217.533
plot(mod1, what = "classification")
table(class, mod1$classification)
##
## class 1 2 3
## Chemical 9 26 1
## Normal 72 4 0
## Overt 0 6 27
par(mfrow = c(2,2))
plot(mod1, what = "uncertainty", dimens = c(2,1), main = "")
plot(mod1, what = "uncertainty", dimens = c(3,1), main = "")
plot(mod1, what = "uncertainty", dimens = c(2,3), main = "")
par(mfrow = c(1,1))
ICL = mclustICL(X)
summary(ICL)
## Best ICL values:
## VVV,3 EVE,6 EVE,7
## ICL -4770.169 -4797.38232 -4797.50566
## ICL diff 0.000 -27.21342 -27.33677
plot(ICL)
LRT = mclustBootstrapLRT(X, modelName = "VVV")
LRT
## Bootstrap sequential LRT for the number of mixture components
## -------------------------------------------------------------
## Model = VVV
## Replications = 999
## LRTS bootstrap p-value
## 1 vs 2 361.16739 0.001
## 2 vs 3 123.49685 0.001
## 3 vs 4 16.76161 0.482
data(iris)
class <- iris$Species
table(class)
## class
## setosa versicolor virginica
## 50 50 50
X <- iris[,1:4]
head(X)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
mod2 <- MclustDA(X, class, modelType = "EDDA")
summary(mod2)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## EDDA model summary:
##
## log.likelihood n df BIC
## -187.7097 150 36 -555.8024
##
## Classes n Model G
## setosa 50 VEV 1
## versicolor 50 VEV 1
## virginica 50 VEV 1
##
## Training classification summary:
##
## Predicted
## Class setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 0 50
##
## Training error = 0.02
plot(mod2, what = "scatterplot")
plot(mod2, what = "classification")
data(banknote)
class <- banknote$Status
table(class)
## class
## counterfeit genuine
## 100 100
X <- banknote[,-1]
head(X)
## Length Left Right Bottom Top Diagonal
## 1 214.8 131.0 131.1 9.0 9.7 141.0
## 2 214.6 129.7 129.7 8.1 9.5 141.7
## 3 214.8 129.7 129.7 8.7 9.6 142.2
## 4 214.8 129.7 129.6 7.5 10.4 142.0
## 5 215.0 129.6 129.7 10.4 7.7 141.8
## 6 215.7 130.8 130.5 9.0 10.1 141.4
mod3 <- MclustDA(X, class)
summary(mod3)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log.likelihood n df BIC
## -646.0801 200 66 -1641.849
##
## Classes n Model G
## counterfeit 100 EVE 2
## genuine 100 XXX 1
##
## Training classification summary:
##
## Predicted
## Class counterfeit genuine
## counterfeit 100 0
## genuine 0 100
##
## Training error = 0
plot(mod3, what = "scatterplot")
plot(mod3, what = "classification")
unlist(cvMclustDA(mod2, nfold = 10)[2:3])
## error se
## 0.0200000 0.0101835
unlist(cvMclustDA(mod3, nfold = 10)[2:3])
## error se
## 0 0
data(acidity)
mod4 <- densityMclust(acidity)
summary(mod4)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust E (univariate, equal variance) model with 2 components:
##
## log.likelihood n df BIC ICL
## -185.9493 155 4 -392.0723 -398.5554
##
## Clustering table:
## 1 2
## 98 57
plot(mod4, what = "BIC")
plot(mod4, what = "density", data = acidity, breaks = 15)
plot(mod4, what = "diagnostic", type = "cdf")
plot(mod4, what = "diagnostic", type = "qq")
data(faithful)
mod5 <- densityMclust(faithful)
summary(mod5)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3 components:
##
## log.likelihood n df BIC ICL
## -1126.326 272 11 -2314.316 -2357.824
##
## Clustering table:
## 1 2 3
## 40 97 135
plot(mod5, what = "BIC")
plot(mod5, what = "density")
plot(mod5, what = "density", type = "image", col = "dodgerblue3", grid = 100)
plot(mod5, what = "density", type = "persp")
boot1 <- MclustBootstrap(mod1, nboot = 999, type = "bs")
summary(boot1, what = "se")
## ----------------------------------------------------------
## Resampling standard errors
## ----------------------------------------------------------
## Model = VVV
## Num. of mixture components = 3
## Replications = 999
## Type = nonparametric bootstrap
##
## Mixing probabilities:
## 1 2 3
## 0.05419824 0.05076478 0.03439980
##
## Means:
## 1 2 3
## glucose 0.9971095 3.495092 16.968865
## insulin 7.5755190 28.509588 66.873839
## sspg 7.6801499 31.104490 9.850315
##
## Variances:
## [,,1]
## glucose insulin sspg
## glucose 10.98617 52.3023 55.67768
## insulin 52.30230 514.5611 423.81468
## sspg 55.67768 423.8147 638.96053
## [,,2]
## glucose insulin sspg
## glucose 73.19889 713.4218 454.1136
## insulin 713.42183 8223.6427 3408.2040
## sspg 454.11357 3408.2040 6701.2674
## [,,3]
## glucose insulin sspg
## glucose 1050.8228 4222.914 656.1974
## insulin 4222.9140 19090.309 2472.4658
## sspg 656.1974 2472.466 490.7419
summary(boot1, what = "ci")
## ----------------------------------------------------------
## Resampling confidence intervals
## ----------------------------------------------------------
## Model = VVV
## Num. of mixture components = 3
## Replications = 999
## Type = nonparametric bootstrap
## Confidence level = 0.95
##
## Mixing probabilities:
## 1 2 3
## 2.5% 0.4391053 0.1522011 0.1304827
## 97.5% 0.6531662 0.3523459 0.2650163
##
## Means:
## [,,1]
## glucose insulin sspg
## 2.5% 89.22400 343.2214 150.0103
## 97.5% 93.19995 373.9562 181.4140
## [,,2]
## glucose insulin sspg
## 2.5% 98.82161 446.5122 258.6823
## 97.5% 112.49144 556.9686 372.5100
## [,,3]
## glucose insulin sspg
## 2.5% 196.9730 967.6157 62.84289
## 97.5% 264.2195 1231.6510 101.30441
##
## Variances:
## [,,1]
## glucose insulin sspg
## 2.5% 38.30658 1261.947 1555.379
## 97.5% 81.45462 3217.946 4146.033
## [,,2]
## glucose insulin sspg
## 2.5% 86.63346 3138.875 12422.98
## 97.5% 385.78229 33757.819 38375.98
## [,,3]
## glucose insulin sspg
## 2.5% 3308.589 47536.77 1348.638
## 97.5% 7375.754 121281.74 3292.333
par(mfrow=c(4,3))
plot(boot1, what = "pro")
plot(boot1, what = "mean")
boot4 <- MclustBootstrap(mod4, nboot = 999, type = "bs")
summary(boot4, what = "se")
## ----------------------------------------------------------
## Resampling standard errors
## ----------------------------------------------------------
## Model = E
## Num. of mixture components = 2
## Replications = 999
## Type = nonparametric bootstrap
##
## Mixing probabilities:
## 1 2
## 0.03977832 0.03977832
##
## Means:
## 1 2
## 0.04488707 0.06680960
##
## Variances:
## 1 2
## 0.02458687 0.02458687
summary(boot4, what = "ci")
## ----------------------------------------------------------
## Resampling confidence intervals
## ----------------------------------------------------------
## Model = E
## Num. of mixture components = 2
## Replications = 999
## Type = nonparametric bootstrap
## Confidence level = 0.95
##
## Mixing probabilities:
## 1 2
## 2.5% 0.5452403 0.2994824
## 97.5% 0.7005176 0.4547597
##
## Means:
## 1 2
## 2.5% 4.280150 6.191873
## 97.5% 4.455027 6.456946
##
## Variances:
## 1 2
## 2.5% 0.1436647 0.1436647
## 97.5% 0.2379912 0.2379912
par(mfrow=c(2,2))
plot(boot4, what = "pro")
plot(boot4, what = "mean")
mod1dr <- MclustDR(mod1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (VVV, 3)
##
## Clusters n
## 1 81
## 2 36
## 3 28
##
## Estimated basis vectors:
## Dir1 Dir2 Dir3
## glucose -0.988671 0.76532 -0.966565
## insulin 0.142656 -0.13395 0.252109
## sspg -0.046689 0.62955 0.046837
##
## Dir1 Dir2 Dir3
## Eigenvalues 1.3506 0.75608 0.53412
## Cum. % 51.1440 79.77436 100.00000
plot(mod1dr, what = "pairs")
plot(mod1dr, what = "boundaries", ngrid = 200)
mod1dr <- MclustDR(mod1, lambda = 1)
summary(mod1dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: Mclust (VVV, 3)
##
## Clusters n
## 1 81
## 2 36
## 3 28
##
## Estimated basis vectors:
## Dir1 Dir2
## glucose 0.764699 0.86359
## insulin -0.643961 -0.22219
## sspg 0.023438 -0.45260
##
## Dir1 Dir2
## Eigenvalues 1.2629 0.35218
## Cum. % 78.1939 100.00000
plot(mod1dr, what = "scatterplot")
plot(mod1dr, what = "boundaries", ngrid = 200)
mod2dr <- MclustDR(mod2)
summary(mod2dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: EDDA
##
## Classes n Model G
## setosa 50 VEV 1
## versicolor 50 VEV 1
## virginica 50 VEV 1
##
## Estimated basis vectors:
## Dir1 Dir2 Dir3 Dir4
## Sepal.Length 0.17425 -0.193663 0.64081 -0.46231
## Sepal.Width 0.45292 0.066561 0.34852 0.57110
## Petal.Length -0.61629 -0.311030 -0.42366 0.46256
## Petal.Width -0.62024 0.928076 0.53703 -0.49613
##
## Dir1 Dir2 Dir3 Dir4
## Eigenvalues 0.94747 0.68835 0.076141 0.052607
## Cum. % 53.69408 92.70374 97.018700 100.000000
plot(mod2dr, what = "scatterplot")
plot(mod2dr, what = "boundaries", ngrid = 200)
mod3dr <- MclustDR(mod3)
summary(mod3dr)
## -----------------------------------------------------------------
## Dimension reduction for model-based clustering and classification
## -----------------------------------------------------------------
##
## Mixture model type: MclustDA
##
## Classes n Model G
## counterfeit 100 EVE 2
## genuine 100 XXX 1
##
## Estimated basis vectors:
## Dir1 Dir2 Dir3 Dir4 Dir5 Dir6
## Length -0.10139 -0.328225 0.797068 -0.033629 -0.3174275 0.085062
## Left -0.21718 -0.305014 -0.303111 -0.893349 0.3700659 -0.565410
## Right 0.29222 -0.018401 -0.495891 0.407413 -0.8612986 0.480799
## Bottom 0.57591 0.445352 0.120173 -0.034595 0.0043174 -0.078640
## Top 0.57542 0.385535 0.100865 -0.103623 0.1359128 0.625902
## Diagonal -0.44089 0.672250 -0.047784 -0.151252 -0.0443255 0.209691
##
## Dir1 Dir2 Dir3 Dir4 Dir5 Dir6
## Eigenvalues 0.87242 0.55373 0.48546 0.13291 0.053075 0.027273
## Cum. % 41.05755 67.11689 89.96377 96.21866 98.716489 100.000000
plot(mod3dr, what = "scatterplot")
plot(mod3dr, what = "boundaries", ngrid = 200)
Most of the graphs produced by mclust use colors that by default are defined in the following options:
mclust.options("bicPlotColors")
## EII VII EEI EVI VEI VVI EEE
## "gray" "black" "#218B21" "#41884F" "#508476" "#58819C" "#597DC3"
## EVE VEE VVE EEV VEV EVV VVV
## "#5178EA" "#716EE7" "#9B60B8" "#B2508B" "#C03F60" "#C82A36" "#CC0000"
## E V
## "gray" "black"
mclust.options("classPlotColors")
## [1] "dodgerblue2" "red3" "green3" "slateblue"
## [5] "darkorange" "skyblue1" "violetred4" "forestgreen"
## [9] "steelblue4" "slategrey" "brown" "black"
## [13] "darkseagreen" "darkgoldenrod3" "olivedrab" "royalblue"
## [17] "tomato4" "cyan2" "springgreen2"
The first option controls colors used for plotting BIC, ICL, etc. curves, whereas the second option is used to assign colors for indicating clusters or classes when plotting data.
Color-blind-friendly palettes can be defined and assigned to the above options as follows:
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#999999", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
bicPlotColors <- mclust.options("bicPlotColors")
bicPlotColors[1:14] <- c(cbPalette, cbPalette[1:6])
mclust.options("bicPlotColors" = bicPlotColors)
mclust.options("classPlotColors" = cbPalette)
clPairs(iris[,-5], iris$Species)
mod <- Mclust(iris[,-5])
plot(mod, what = "BIC")
plot(mod, what = "classification")
The above color definitions are adapted from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/, but users can easily define their own palettes if needed.
Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233. https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf
Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.
Fraley C., Raftery A. E., Murphy T. B. and Scrucca L. (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.