A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique

Clinical narratives contain crucial patient information for predicting cardiac failure. Accurate and timely cardiac failure recognition (CFR) significantly impacts patient outcomes but faces challenges like limited dataset sizes, feature space sparsity, and underutilization of vital sign data. This...

Full description

Bibliographic Details
Main Authors: Dalhatu, Sirajo Muhammad, Azmi Murad, Masrah Azrifah
Format: Article
Language:English
Published: Politeknik Negeri Padang 2024
Online Access:http://psasir.upm.edu.my/id/eprint/118176/
http://psasir.upm.edu.my/id/eprint/118176/1/118176.pdf
_version_ 1848867451164426240
author Dalhatu, Sirajo Muhammad
Azmi Murad, Masrah Azrifah
author_facet Dalhatu, Sirajo Muhammad
Azmi Murad, Masrah Azrifah
author_sort Dalhatu, Sirajo Muhammad
building UPM Institutional Repository
collection Online Access
description Clinical narratives contain crucial patient information for predicting cardiac failure. Accurate and timely cardiac failure recognition (CFR) significantly impacts patient outcomes but faces challenges like limited dataset sizes, feature space sparsity, and underutilization of vital sign data. This study addresses these issues by developing a methodology to improve CFR accuracy and interpretability within clinical narratives. Four datasets—the Framingham Heart Study, Heart Disease from Kaggle, Cleveland Heart Disease, and Heart Failure Clinical Records—undergo preprocessing, including handling missing values, removing duplicates, scaling, encoding categorical variables, and transforming unstructured data using natural language processing (NLP). Various feature selection methods (Chi-Squared, Forward Selection, L1 Regularization) are used to identify influential features for CFR, and the SHapley Additive exPlanations (SHAP) technique is integrated to improve interpretability. Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) models are trained and evaluated. Performance was evaluated using accuracy, precision, recall, f1-score, and area under the receiver operating characteristic curve (AUC-ROC). Results indicate that L1 Regularization with LR and Chi-Squared with RF perform best for specific datasets. The final model, combining all datasets with Forward Selection and RF, achieves high accuracy (91%), precision (87%), recall (97%), f1-score (91%), and AUC-ROC (94%). This study concludes that advanced text-based feature selection and SHAP interpretability significantly enhance CFR model accuracy and transparency, aiding clinical decision-making. Future research should incorporate more diverse datasets, explore advanced NLP techniques, and validate models in various clinical settings to enhance robustness and applicability.
first_indexed 2025-11-15T14:36:42Z
format Article
id upm-118176
institution Universiti Putra Malaysia
institution_category Local University
language English
last_indexed 2025-11-15T14:36:42Z
publishDate 2024
publisher Politeknik Negeri Padang
recordtype eprints
repository_type Digital Repository
spelling upm-1181762025-06-26T04:58:07Z http://psasir.upm.edu.my/id/eprint/118176/ A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique Dalhatu, Sirajo Muhammad Azmi Murad, Masrah Azrifah Clinical narratives contain crucial patient information for predicting cardiac failure. Accurate and timely cardiac failure recognition (CFR) significantly impacts patient outcomes but faces challenges like limited dataset sizes, feature space sparsity, and underutilization of vital sign data. This study addresses these issues by developing a methodology to improve CFR accuracy and interpretability within clinical narratives. Four datasets—the Framingham Heart Study, Heart Disease from Kaggle, Cleveland Heart Disease, and Heart Failure Clinical Records—undergo preprocessing, including handling missing values, removing duplicates, scaling, encoding categorical variables, and transforming unstructured data using natural language processing (NLP). Various feature selection methods (Chi-Squared, Forward Selection, L1 Regularization) are used to identify influential features for CFR, and the SHapley Additive exPlanations (SHAP) technique is integrated to improve interpretability. Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) models are trained and evaluated. Performance was evaluated using accuracy, precision, recall, f1-score, and area under the receiver operating characteristic curve (AUC-ROC). Results indicate that L1 Regularization with LR and Chi-Squared with RF perform best for specific datasets. The final model, combining all datasets with Forward Selection and RF, achieves high accuracy (91%), precision (87%), recall (97%), f1-score (91%), and AUC-ROC (94%). This study concludes that advanced text-based feature selection and SHAP interpretability significantly enhance CFR model accuracy and transparency, aiding clinical decision-making. Future research should incorporate more diverse datasets, explore advanced NLP techniques, and validate models in various clinical settings to enhance robustness and applicability. Politeknik Negeri Padang 2024 Article PeerReviewed text en cc_by_sa_4 http://psasir.upm.edu.my/id/eprint/118176/1/118176.pdf Dalhatu, Sirajo Muhammad and Azmi Murad, Masrah Azrifah (2024) A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique. International Journal on Informatics Visualization, 8 (4). pp. 2287-2296. ISSN 2549-9610; eISSN: 2549-9904 https://www.joiv.org/index.php/joiv/article/view/3664/1158 10.62527/joiv.8.4.3664
spellingShingle Dalhatu, Sirajo Muhammad
Azmi Murad, Masrah Azrifah
A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title_full A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title_fullStr A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title_full_unstemmed A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title_short A model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and SHAP technique
title_sort model for enhancing pattern recognition in clinical narrative datasets through text-based feature selection and shap technique
url http://psasir.upm.edu.my/id/eprint/118176/
http://psasir.upm.edu.my/id/eprint/118176/
http://psasir.upm.edu.my/id/eprint/118176/
http://psasir.upm.edu.my/id/eprint/118176/1/118176.pdf