SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction
An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study propos...
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Penerbit Universiti Kebangsaan Malaysia
2025
|
| Online Access: | http://journalarticle.ukm.my/25821/ http://journalarticle.ukm.my/25821/1/SME%2017.pdf |
| _version_ | 1848816459531157504 |
|---|---|
| author | Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie, |
| author_facet | Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie, |
| author_sort | Norashikin Nasaruddin, |
| building | UKM Institutional Repository |
| collection | Online Access |
| description | An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches. |
| first_indexed | 2025-11-15T01:06:13Z |
| format | Article |
| id | oai:generic.eprints.org:25821 |
| institution | Universiti Kebangasaan Malaysia |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-15T01:06:13Z |
| publishDate | 2025 |
| publisher | Penerbit Universiti Kebangsaan Malaysia |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | oai:generic.eprints.org:258212025-09-04T08:06:48Z http://journalarticle.ukm.my/25821/ SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie, An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches. Penerbit Universiti Kebangsaan Malaysia 2025 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/25821/1/SME%2017.pdf Norashikin Nasaruddin, and Nurulkamal Masseran, and Wan Mohd Razi Idris, and Ahmad Zia Ul-Saufie, (2025) SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction. Sains Malaysiana, 54 (6). pp. 1629-1639. ISSN 0126-6039 https://www.ukm.my/jsm/english_journals/vol54num6_2025/contentsVol54num6_2025.html |
| spellingShingle | Norashikin Nasaruddin, Nurulkamal Masseran, Wan Mohd Razi Idris, Ahmad Zia Ul-Saufie, SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title | SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title_full | SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title_fullStr | SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title_full_unstemmed | SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title_short | SMOTE-PCADBSCAN: a novel approach for addressing class imbalance in water quality prediction |
| title_sort | smote-pcadbscan: a novel approach for addressing class imbalance in water quality prediction |
| url | http://journalarticle.ukm.my/25821/ http://journalarticle.ukm.my/25821/ http://journalarticle.ukm.my/25821/1/SME%2017.pdf |