Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo

Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine lea...

Full description

Bibliographic Details
Main Authors: Terence, Yong Koon Beh, Swee, Chuan Tan, Hwee, Theng Yeo
Format: Article
Language:English
Published: Penerbit UiTM 2014
Online Access:https://ir.uitm.edu.my/id/eprint/13930/
_version_ 1848803348366491648
author Terence, Yong Koon Beh
Swee, Chuan Tan
Hwee, Theng Yeo
author_facet Terence, Yong Koon Beh
Swee, Chuan Tan
Hwee, Theng Yeo
author_sort Terence, Yong Koon Beh
building UiTM Institutional Repository
collection Online Access
description Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine learning and data mining community because many traditional classification algorithms assume similar proportions of majority and minority classes. When the data is imbalanced, these algorithms generate models that achieve good classification accuracy for the majority class, but poor accuracy for the minority class. This paper reports our experience in applying data balancing techniques to develop a classifier for an imbalanced real-world fraud detection data set. We evaluated the models generated from seven classification algorithms with two simple data balancing techniques. Despite many ideas floating in the literature to tackle the imbalanced issue, our study shows the simplest data balancing technique is all that is required to significantly improve the accuracy in identifying the primary class of interest (i.e., the minority class) in all the seven algorithms tested. Our results also show that precision and recall are useful and effective measures for evaluating models created from artificially balanced data. Hence, we advise data mining practitioners to try simple data balancing first before exploring more sophisticated techniques to tackle the class imbalance problem.
first_indexed 2025-11-14T21:37:49Z
format Article
id uitm-13930
institution Universiti Teknologi MARA
institution_category Local University
language English
last_indexed 2025-11-14T21:37:49Z
publishDate 2014
publisher Penerbit UiTM
recordtype eprints
repository_type Digital Repository
spelling uitm-139302022-06-14T02:45:50Z https://ir.uitm.edu.my/id/eprint/13930/ Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo mjoc Terence, Yong Koon Beh Swee, Chuan Tan Hwee, Theng Yeo Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine learning and data mining community because many traditional classification algorithms assume similar proportions of majority and minority classes. When the data is imbalanced, these algorithms generate models that achieve good classification accuracy for the majority class, but poor accuracy for the minority class. This paper reports our experience in applying data balancing techniques to develop a classifier for an imbalanced real-world fraud detection data set. We evaluated the models generated from seven classification algorithms with two simple data balancing techniques. Despite many ideas floating in the literature to tackle the imbalanced issue, our study shows the simplest data balancing technique is all that is required to significantly improve the accuracy in identifying the primary class of interest (i.e., the minority class) in all the seven algorithms tested. Our results also show that precision and recall are useful and effective measures for evaluating models created from artificially balanced data. Hence, we advise data mining practitioners to try simple data balancing first before exploring more sophisticated techniques to tackle the class imbalance problem. Penerbit UiTM 2014 Article PeerReviewed text en https://ir.uitm.edu.my/id/eprint/13930/1/13930.pdf Terence, Yong Koon Beh and Swee, Chuan Tan and Hwee, Theng Yeo (2014) Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo. (2014) Malaysian Journal of Computing (MJoC) <https://ir.uitm.edu.my/view/publication/Malaysian_Journal_of_Computing_=28MJoC=29.html>, 2 (2). pp. 13-33. ISSN 2231-7473 https://mjoc.uitm.edu.my/
spellingShingle Terence, Yong Koon Beh
Swee, Chuan Tan
Hwee, Theng Yeo
Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title_full Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title_fullStr Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title_full_unstemmed Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title_short Building classification models from imbalanced fraud detection data / Terence Yong Koon Beh, Swee Chuan Tan and Hwee Theng Yeo
title_sort building classification models from imbalanced fraud detection data / terence yong koon beh, swee chuan tan and hwee theng yeo
url https://ir.uitm.edu.my/id/eprint/13930/
https://ir.uitm.edu.my/id/eprint/13930/