SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction

An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study propos...

Full description

Bibliographic Details
Main Authors: Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie
Format: Article
Language:English
Published: Penerbit Universiti Kebangsaan Malaysia 2025
Online Access:http://journalarticle.ukm.my/25821/
http://journalarticle.ukm.my/25821/1/SME%2017.pdf
_version_ 1848816459531157504
author Norashikin Nasaruddin,
Nurulkamal Masseran,
Wan Mohd Razi Idris,
Ahmad Zia Ul-Saufie,
author_facet Norashikin Nasaruddin,
Nurulkamal Masseran,
Wan Mohd Razi Idris,
Ahmad Zia Ul-Saufie,
author_sort Norashikin Nasaruddin,
building UKM Institutional Repository
collection Online Access
description An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches.
first_indexed 2025-11-15T01:06:13Z
format Article
id oai:generic.eprints.org:25821
institution Universiti Kebangasaan Malaysia
institution_category Local University
language English
last_indexed 2025-11-15T01:06:13Z
publishDate 2025
publisher Penerbit Universiti Kebangsaan Malaysia
recordtype eprints
repository_type Digital Repository
spelling oai:generic.eprints.org:258212025-09-04T08:06:48Z http://journalarticle.ukm.my/25821/ SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie, An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches. Penerbit Universiti Kebangsaan Malaysia 2025 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/25821/1/SME%2017.pdf Norashikin Nasaruddin, and Nurulkamal Masseran, and Wan Mohd Razi Idris, and Ahmad Zia Ul-Saufie, (2025) SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction. Sains Malaysiana, 54 (6). pp. 1629-1639. ISSN 0126-6039 https://www.ukm.my/jsm/english_journals/vol54num6_2025/contentsVol54num6_2025.html
spellingShingle Norashikin Nasaruddin,
Nurulkamal Masseran,
Wan Mohd Razi Idris,
Ahmad Zia Ul-Saufie,
SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title_full SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title_fullStr SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title_full_unstemmed SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title_short SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
title_sort smote-pcadbscan: a novel approach for addressing class imbalance in water quality prediction
url http://journalarticle.ukm.my/25821/
http://journalarticle.ukm.my/25821/
http://journalarticle.ukm.my/25821/1/SME%2017.pdf