A review of feature selection methods on diabetes mellitus classification

Diabetes is a leading cause of death in the United States and leads to serious health complications. In recent decades, artificial intelligence technology and its subfield, machine learning, have been increasingly utilized to aid in disease diagnosis. Machine learning methods must be robust enough t...

Full description

Bibliographic Details
Main Authors: Nur Farahaina, Idris, Mohd Arfian, Ismail, Shahreen, Kasim, Rohayanti, Hassan, Deshinta Arrova Dewi, ., Abdullah Munzir, Mohd Fauzi, Rahmat, Hidayat
Format: Article
Language:English
Published: Indonesian Society for Knowledge and Human Development 2025
Subjects:
Online Access:https://umpir.ump.edu.my/id/eprint/45191/
Description
Summary:Diabetes is a leading cause of death in the United States and leads to serious health complications. In recent decades, artificial intelligence technology and its subfield, machine learning, have been increasingly utilized to aid in disease diagnosis. Machine learning methods must be robust enough to handle the variability in diabetes datasets, which often encompass diverse patient demographics, clinical characteristics, and environmental factors. This motivates researchers to develop suitable feature selection methods that complement machine learning methods, thereby reducing time and complexity. However, feature selection may negatively impact classification accuracy by inadvertently removing essential features, or it may increase the time required due to repetitive processes during evaluation. Hence, thorough reviews of feature selection methods for diabetes classification are being conducted to evaluate their effectiveness. There are three primary categories of feature selection methods: embedded, wrapper, and filter methods. All the methods had distinct mechanisms and effects during the classification process. This study reviewed feature selection methods in each category, such as Random Forest from the embedded method, Chi-Square test from the filter method, and Recursive Feature Elimination from the wrapper method. The Chi-Square test is efficient only with categorical features, Random Forest is effective but causes high complexity and increased time due to its ensemble nature, and Recursive Feature Elimination produces the best performance but is not very suitable for data with high dimensionality. The findings indicate that Recursive Feature Elimination is more suitable for diabetes classification, as it is fast and yields good performance.