Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data

Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture...

Full description

Bibliographic Details
Main Author: Burhanuddin, Nurul Afiqah
Format: Thesis
Language:English
Published: 2024
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/118420/
http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf
_version_ 1848867757933723648
author Burhanuddin, Nurul Afiqah
author_facet Burhanuddin, Nurul Afiqah
author_sort Burhanuddin, Nurul Afiqah
building UPM Institutional Repository
collection Online Access
description Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture components to grow infinitely. In this thesis, we aim to present some modifications of Bayesian nonparametric methods focusing on clustering mixed-type data, where the data comprises of continuous, ordinal, and nominal data. Many studies have shown successful applications of the Dirichlet process mixture (DPM) model for clustering continuous data. However, the recent DPM model for clustering mixed-type data assumes a common covariance matrix across clusters, which is too restrictive in real practice. Accordingly, we develop a DPM model for clustering mixed-type data that allows for cluster-specific covariance matrices. To demonstrate the flexibility of our model, we compare it with the model with a common covariance matrix. Through this comparison, our model shows superior performance in terms of Normalized Mutual Information (NMI) in simulated datasets with different cluster shapes and two real data applications. Our model also succeeds in estimating the true number of clusters in all cases as opposed to the model with a common covariance assumption that tends to overcluster the data. When dealing with multivariate data, not all variables contribute towards cluster discrimination. To distinguish between relevant and irrelevant clustering variables, the DPM model for mixed-type data is further extended by specifying hierarchical shrinkage prior on the component means. This can be thought of as an implicit variable selection in clustering. The hierarchical shrinkage prior considered involves the normal-gamma prior for the continuous and ordinal data; while for nominal data, the grouped normal-gamma prior is used. The performances of the proposed model with shrinkage prior and without shrinkage prior are then compared. The comparison shows that the model with shrinkage prior achieves better clustering performance with higher NMI value, especially in simulated datasets with highly overlapping clusters and real datasets. Throughout the comparison, the model with shrinkage prior also produces a tighter clustering output measured in the form of silhouette width. Furthermore, the proposed model also successfully distinguishes relevant variables from noisy ones, as reflected by higher NMI value observed when the model is fitted with only the relevant variables. The standard DPM model is introduced to address unsupervised learning problems where the data is analyzed without any background knowledge. To consider this extra knowledge in the clustering process, we develop a constrained DPM model that can incorporate labels as side information. These labels are considered in our formulation through a product partition prior that gives clusters of observations with similar labels a higher prior preference. The formulation is further extended to handle multiple side information. The empirical results on several simulated and real datasets show that our model consistently improves its clustering performance in terms of NMI value as more labeled data become available. Even in the presence of noisy labels, the proposed model rarely performs worse than the standard unsupervised model, especially on continuous datasets. In multiple side information experiments, consistent increments in NMI value are also observed with access to more side information.
first_indexed 2025-11-15T14:41:35Z
format Thesis
id upm-118420
institution Universiti Putra Malaysia
institution_category Local University
language English
last_indexed 2025-11-15T14:41:35Z
publishDate 2024
recordtype eprints
repository_type Digital Repository
spelling upm-1184202025-08-04T07:34:29Z http://psasir.upm.edu.my/id/eprint/118420/ Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data Burhanuddin, Nurul Afiqah Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture components to grow infinitely. In this thesis, we aim to present some modifications of Bayesian nonparametric methods focusing on clustering mixed-type data, where the data comprises of continuous, ordinal, and nominal data. Many studies have shown successful applications of the Dirichlet process mixture (DPM) model for clustering continuous data. However, the recent DPM model for clustering mixed-type data assumes a common covariance matrix across clusters, which is too restrictive in real practice. Accordingly, we develop a DPM model for clustering mixed-type data that allows for cluster-specific covariance matrices. To demonstrate the flexibility of our model, we compare it with the model with a common covariance matrix. Through this comparison, our model shows superior performance in terms of Normalized Mutual Information (NMI) in simulated datasets with different cluster shapes and two real data applications. Our model also succeeds in estimating the true number of clusters in all cases as opposed to the model with a common covariance assumption that tends to overcluster the data. When dealing with multivariate data, not all variables contribute towards cluster discrimination. To distinguish between relevant and irrelevant clustering variables, the DPM model for mixed-type data is further extended by specifying hierarchical shrinkage prior on the component means. This can be thought of as an implicit variable selection in clustering. The hierarchical shrinkage prior considered involves the normal-gamma prior for the continuous and ordinal data; while for nominal data, the grouped normal-gamma prior is used. The performances of the proposed model with shrinkage prior and without shrinkage prior are then compared. The comparison shows that the model with shrinkage prior achieves better clustering performance with higher NMI value, especially in simulated datasets with highly overlapping clusters and real datasets. Throughout the comparison, the model with shrinkage prior also produces a tighter clustering output measured in the form of silhouette width. Furthermore, the proposed model also successfully distinguishes relevant variables from noisy ones, as reflected by higher NMI value observed when the model is fitted with only the relevant variables. The standard DPM model is introduced to address unsupervised learning problems where the data is analyzed without any background knowledge. To consider this extra knowledge in the clustering process, we develop a constrained DPM model that can incorporate labels as side information. These labels are considered in our formulation through a product partition prior that gives clusters of observations with similar labels a higher prior preference. The formulation is further extended to handle multiple side information. The empirical results on several simulated and real datasets show that our model consistently improves its clustering performance in terms of NMI value as more labeled data become available. Even in the presence of noisy labels, the proposed model rarely performs worse than the standard unsupervised model, especially on continuous datasets. In multiple side information experiments, consistent increments in NMI value are also observed with access to more side information. 2024-01 Thesis NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf Burhanuddin, Nurul Afiqah (2024) Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data. Doctoral thesis, Universiti Putra Malaysia. http://ethesis.upm.edu.my/id/eprint/18377 Mixture (Mathematics) Clustering (Statistics) Bayesian statistical decision theory
spellingShingle Mixture (Mathematics)
Clustering (Statistics)
Bayesian statistical decision theory
Burhanuddin, Nurul Afiqah
Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_full Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_fullStr Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_full_unstemmed Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_short Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_sort bayesian nonparametric clustering with dirichlet process mixture model for mixed-type data
topic Mixture (Mathematics)
Clustering (Statistics)
Bayesian statistical decision theory
url http://psasir.upm.edu.my/id/eprint/118420/
http://psasir.upm.edu.my/id/eprint/118420/
http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf