Evolutionary undersampling for extremely imbalanced big data classification under apache spark

The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority...

Full description

Bibliographic Details
Main Authors: Triguero, Isaac, Galar, M., Merino, D., Maillo, Jesus, Bustince, H., Herrera, Francisco
Format: Conference or Workshop Item
Published: 2016
Subjects:
Online Access:https://eprints.nottingham.ac.uk/38876/
_version_ 1848795710220140544
author Triguero, Isaac
Galar, M.
Merino, D.
Maillo, Jesus
Bustince, H.
Herrera, Francisco
author_facet Triguero, Isaac
Galar, M.
Merino, D.
Maillo, Jesus
Bustince, H.
Herrera, Francisco
author_sort Triguero, Isaac
building Nottingham Research Data Repository
collection Online Access
description The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.
first_indexed 2025-11-14T19:36:25Z
format Conference or Workshop Item
id nottingham-38876
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:36:25Z
publishDate 2016
recordtype eprints
repository_type Digital Repository
spelling nottingham-388762020-05-04T17:59:57Z https://eprints.nottingham.ac.uk/38876/ Evolutionary undersampling for extremely imbalanced big data classification under apache spark Triguero, Isaac Galar, M. Merino, D. Maillo, Jesus Bustince, H. Herrera, Francisco The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification. 2016-07-24 Conference or Workshop Item PeerReviewed Triguero, Isaac, Galar, M., Merino, D., Maillo, Jesus, Bustince, H. and Herrera, Francisco (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC), 24-29 July, 2016, Vancouver, Canada. Big data Sparks Data mining Data models Biological cells Proposals Standards http://ieeexplore.ieee.org/document/7743853/
spellingShingle Big data
Sparks
Data mining
Data models
Biological cells
Proposals
Standards
Triguero, Isaac
Galar, M.
Merino, D.
Maillo, Jesus
Bustince, H.
Herrera, Francisco
Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title_full Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title_fullStr Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title_full_unstemmed Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title_short Evolutionary undersampling for extremely imbalanced big data classification under apache spark
title_sort evolutionary undersampling for extremely imbalanced big data classification under apache spark
topic Big data
Sparks
Data mining
Data models
Biological cells
Proposals
Standards
url https://eprints.nottingham.ac.uk/38876/
https://eprints.nottingham.ac.uk/38876/