An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data

This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership v...

Full description

Bibliographic Details
Main Author: Lai, Daphne Teck Ching
Format: Thesis (University of Nottingham only)
Language:English
Published: 2014
Subjects:
Online Access:https://eprints.nottingham.ac.uk/14232/
_version_ 1848791909243289600
author Lai, Daphne Teck Ching
author_facet Lai, Daphne Teck Ching
author_sort Lai, Daphne Teck Ching
building Nottingham Research Data Repository
collection Online Access
description This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership values), with the primary aim of creating a semi-supervised fuzzy clustering algorithm that shows good performance on real-world data. Hence, there are two main objectives in this work. The first objective is to explore novel technical improvements to semi-supervised Fuzzy c-means (ssFCM) that can address the problem of initialisation sensitivity and can improve results. The second objective is to apply the developed algorithm on real biomedical data, such as the Nottingham Tenovus Breast Cancer (NTBC) dataset, to create an automatic methodology for identifying stable subgroups which have been previously elicited semi-manually. Investigations were conducted into detailed improvements to the ss-FCM algorithm framework, including a range of distance metrics, initialisation and feature selection techniques and scaling parameter values. These methodologies were tested on different data sources to demonstrate their generalisation properties. Evaluation results between methodologies were compared to determine suitable techniques on various University of California, Irvine (UCI) benchmark datasets. Results were promising, suggesting that initialisation techniques, feature selection and scaling parameter adjustment can increase ssFCM performance. Based on these investigations, a novel ssFCM framework was developed, applied to the NTBC dataset, and various statistical and biological evaluations were conducted. This demonstrated highly significant improvement in agreement with previous classifications, with solutions that are biologically useful and clinically relevant in comparison with Sorias study [141]. On comparison with the latest NTBC study by Green et al. [63], similar clinical results have been observed, confirming stability of the subgroups. Two main contributions to knowledge have been made in this work. Firstly, the ssFCM framework has been improved through various technical refinements, which may be used together or separately. Secondly, the NTBC dataset has been successfully automatically clustered (in a single algorithm) into clinical sub-groups which had previously been elucidated semi-manually. While results are very promising, it is important to note that fully, detailed validation of the framework has only been carried out on the NTBC dataset, and so there is limit on the general conclusions that may be drawn. Future studies include applying the framework on other biomedical datasets and applying distance metric learning into ssFCM. In conclusion, an enhanced ssFCM framework has been proposed, and has been demonstrated to have highly significant improved accuracy on the NTBC dataset.
first_indexed 2025-11-14T18:36:00Z
format Thesis (University of Nottingham only)
id nottingham-14232
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T18:36:00Z
publishDate 2014
recordtype eprints
repository_type Digital Repository
spelling nottingham-142322025-02-28T11:29:32Z https://eprints.nottingham.ac.uk/14232/ An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data Lai, Daphne Teck Ching This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership values), with the primary aim of creating a semi-supervised fuzzy clustering algorithm that shows good performance on real-world data. Hence, there are two main objectives in this work. The first objective is to explore novel technical improvements to semi-supervised Fuzzy c-means (ssFCM) that can address the problem of initialisation sensitivity and can improve results. The second objective is to apply the developed algorithm on real biomedical data, such as the Nottingham Tenovus Breast Cancer (NTBC) dataset, to create an automatic methodology for identifying stable subgroups which have been previously elicited semi-manually. Investigations were conducted into detailed improvements to the ss-FCM algorithm framework, including a range of distance metrics, initialisation and feature selection techniques and scaling parameter values. These methodologies were tested on different data sources to demonstrate their generalisation properties. Evaluation results between methodologies were compared to determine suitable techniques on various University of California, Irvine (UCI) benchmark datasets. Results were promising, suggesting that initialisation techniques, feature selection and scaling parameter adjustment can increase ssFCM performance. Based on these investigations, a novel ssFCM framework was developed, applied to the NTBC dataset, and various statistical and biological evaluations were conducted. This demonstrated highly significant improvement in agreement with previous classifications, with solutions that are biologically useful and clinically relevant in comparison with Sorias study [141]. On comparison with the latest NTBC study by Green et al. [63], similar clinical results have been observed, confirming stability of the subgroups. Two main contributions to knowledge have been made in this work. Firstly, the ssFCM framework has been improved through various technical refinements, which may be used together or separately. Secondly, the NTBC dataset has been successfully automatically clustered (in a single algorithm) into clinical sub-groups which had previously been elucidated semi-manually. While results are very promising, it is important to note that fully, detailed validation of the framework has only been carried out on the NTBC dataset, and so there is limit on the general conclusions that may be drawn. Future studies include applying the framework on other biomedical datasets and applying distance metric learning into ssFCM. In conclusion, an enhanced ssFCM framework has been proposed, and has been demonstrated to have highly significant improved accuracy on the NTBC dataset. 2014-07-15 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/14232/1/correction_noblue.pdf Lai, Daphne Teck Ching (2014) An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data. PhD thesis, University of Nottingham. clustering semi-supervised learning breast cancer biomedical data
spellingShingle clustering
semi-supervised learning
breast cancer
biomedical data
Lai, Daphne Teck Ching
An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title_full An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title_fullStr An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title_full_unstemmed An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title_short An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
title_sort exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
topic clustering
semi-supervised learning
breast cancer
biomedical data
url https://eprints.nottingham.ac.uk/14232/