Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data

Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture...

Full description

Bibliographic Details
Main Author:	Burhanuddin, Nurul Afiqah
Format:	Thesis
Language:	English
Published:	2024
Subjects:	Mixture (Mathematics) Clustering (Statistics) Bayesian statistical decision theory
Online Access:	http://psasir.upm.edu.my/id/eprint/118420/ http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf

_version_	1848867757933723648
author	Burhanuddin, Nurul Afiqah
author_facet	Burhanuddin, Nurul Afiqah
author_sort	Burhanuddin, Nurul Afiqah
building	UPM Institutional Repository
collection	Online Access
description	Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture components to grow infinitely. In this thesis, we aim to present some modifications of Bayesian nonparametric methods focusing on clustering mixed-type data, where the data comprises of continuous, ordinal, and nominal data. Many studies have shown successful applications of the Dirichlet process mixture (DPM) model for clustering continuous data. However, the recent DPM model for clustering mixed-type data assumes a common covariance matrix across clusters, which is too restrictive in real practice. Accordingly, we develop a DPM model for clustering mixed-type data that allows for cluster-specific covariance matrices. To demonstrate the flexibility of our model, we compare it with the model with a common covariance matrix. Through this comparison, our model shows superior performance in terms of Normalized Mutual Information (NMI) in simulated datasets with different cluster shapes and two real data applications. Our model also succeeds in estimating the true number of clusters in all cases as opposed to the model with a common covariance assumption that tends to overcluster the data. When dealing with multivariate data, not all variables contribute towards cluster discrimination. To distinguish between relevant and irrelevant clustering variables, the DPM model for mixed-type data is further extended by specifying hierarchical shrinkage prior on the component means. This can be thought of as an implicit variable selection in clustering. The hierarchical shrinkage prior considered involves the normal-gamma prior for the continuous and ordinal data; while for nominal data, the grouped normal-gamma prior is used. The performances of the proposed model with shrinkage prior and without shrinkage prior are then compared. The comparison shows that the model with shrinkage prior achieves better clustering performance with higher NMI value, especially in simulated datasets with highly overlapping clusters and real datasets. Throughout the comparison, the model with shrinkage prior also produces a tighter clustering output measured in the form of silhouette width. Furthermore, the proposed model also successfully distinguishes relevant variables from noisy ones, as reflected by higher NMI value observed when the model is fitted with only the relevant variables. The standard DPM model is introduced to address unsupervised learning problems where the data is analyzed without any background knowledge. To consider this extra knowledge in the clustering process, we develop a constrained DPM model that can incorporate labels as side information. These labels are considered in our formulation through a product partition prior that gives clusters of observations with similar labels a higher prior preference. The formulation is further extended to handle multiple side information. The empirical results on several simulated and real datasets show that our model consistently improves its clustering performance in terms of NMI value as more labeled data become available. Even in the presence of noisy labels, the proposed model rarely performs worse than the standard unsupervised model, especially on continuous datasets. In multiple side information experiments, consistent increments in NMI value are also observed with access to more side information.
first_indexed	2025-11-15T14:41:35Z
format	Thesis
id	upm-118420
institution	Universiti Putra Malaysia
institution_category	Local University
language	English
last_indexed	2025-11-15T14:41:35Z
publishDate	2024
recordtype	eprints
repository_type	Digital Repository
spelling	upm-1184202025-08-04T07:34:29Z http://psasir.upm.edu.my/id/eprint/118420/ Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data Burhanuddin, Nurul Afiqah Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture components to grow infinitely. In this thesis, we aim to present some modifications of Bayesian nonparametric methods focusing on clustering mixed-type data, where the data comprises of continuous, ordinal, and nominal data. Many studies have shown successful applications of the Dirichlet process mixture (DPM) model for clustering continuous data. However, the recent DPM model for clustering mixed-type data assumes a common covariance matrix across clusters, which is too restrictive in real practice. Accordingly, we develop a DPM model for clustering mixed-type data that allows for cluster-specific covariance matrices. To demonstrate the flexibility of our model, we compare it with the model with a common covariance matrix. Through this comparison, our model shows superior performance in terms of Normalized Mutual Information (NMI) in simulated datasets with different cluster shapes and two real data applications. Our model also succeeds in estimating the true number of clusters in all cases as opposed to the model with a common covariance assumption that tends to overcluster the data. When dealing with multivariate data, not all variables contribute towards cluster discrimination. To distinguish between relevant and irrelevant clustering variables, the DPM model for mixed-type data is further extended by specifying hierarchical shrinkage prior on the component means. This can be thought of as an implicit variable selection in clustering. The hierarchical shrinkage prior considered involves the normal-gamma prior for the continuous and ordinal data; while for nominal data, the grouped normal-gamma prior is used. The performances of the proposed model with shrinkage prior and without shrinkage prior are then compared. The comparison shows that the model with shrinkage prior achieves better clustering performance with higher NMI value, especially in simulated datasets with highly overlapping clusters and real datasets. Throughout the comparison, the model with shrinkage prior also produces a tighter clustering output measured in the form of silhouette width. Furthermore, the proposed model also successfully distinguishes relevant variables from noisy ones, as reflected by higher NMI value observed when the model is fitted with only the relevant variables. The standard DPM model is introduced to address unsupervised learning problems where the data is analyzed without any background knowledge. To consider this extra knowledge in the clustering process, we develop a constrained DPM model that can incorporate labels as side information. These labels are considered in our formulation through a product partition prior that gives clusters of observations with similar labels a higher prior preference. The formulation is further extended to handle multiple side information. The empirical results on several simulated and real datasets show that our model consistently improves its clustering performance in terms of NMI value as more labeled data become available. Even in the presence of noisy labels, the proposed model rarely performs worse than the standard unsupervised model, especially on continuous datasets. In multiple side information experiments, consistent increments in NMI value are also observed with access to more side information. 2024-01 Thesis NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf Burhanuddin, Nurul Afiqah (2024) Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data. Doctoral thesis, Universiti Putra Malaysia. http://ethesis.upm.edu.my/id/eprint/18377 Mixture (Mathematics) Clustering (Statistics) Bayesian statistical decision theory
spellingShingle	Mixture (Mathematics) Clustering (Statistics) Bayesian statistical decision theory Burhanuddin, Nurul Afiqah Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title	Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_full	Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_fullStr	Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_full_unstemmed	Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_short	Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data
title_sort	bayesian nonparametric clustering with dirichlet process mixture model for mixed-type data
topic	Mixture (Mathematics) Clustering (Statistics) Bayesian statistical decision theory
url	http://psasir.upm.edu.my/id/eprint/118420/ http://psasir.upm.edu.my/id/eprint/118420/ http://psasir.upm.edu.my/id/eprint/118420/1/118420%20%28IR%29.pdf

Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data

Similar Items