ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc...

Full description

Bibliographic Details
Main Authors: Triguero, Isaac, del Río, Sara, López, Victoria, Bacardit, Jaume, Benítez, José M., Herrera, Francisco
Format: Article
Published: Elsevier 2015
Subjects:
Online Access:https://eprints.nottingham.ac.uk/45418/
_version_ 1848797127278329856
author Triguero, Isaac
del Río, Sara
López, Victoria
Bacardit, Jaume
Benítez, José M.
Herrera, Francisco
author_facet Triguero, Isaac
del Río, Sara
López, Victoria
Bacardit, Jaume
Benítez, José M.
Herrera, Francisco
author_sort Triguero, Isaac
building Nottingham Research Data Repository
collection Online Access
description The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.
first_indexed 2025-11-14T19:58:56Z
format Article
id nottingham-45418
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:58:56Z
publishDate 2015
publisher Elsevier
recordtype eprints
repository_type Digital Repository
spelling nottingham-454182020-05-04T20:07:05Z https://eprints.nottingham.ac.uk/45418/ ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem Triguero, Isaac del Río, Sara López, Victoria Bacardit, Jaume Benítez, José M. Herrera, Francisco The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems. Elsevier 2015-10 Article PeerReviewed Triguero, Isaac, del Río, Sara, López, Victoria, Bacardit, Jaume, Benítez, José M. and Herrera, Francisco (2015) ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87 . pp. 69-79. ISSN 1872-7409 Bioinformatics; Big data; Hadoop; MapReduce; Imbalance classification; Evolutionary feature selection http://www.sciencedirect.com/science/article/pii/S0950705115002130 doi:10.1016/j.knosys.2015.05.027 doi:10.1016/j.knosys.2015.05.027
spellingShingle Bioinformatics; Big data; Hadoop; MapReduce; Imbalance classification; Evolutionary feature selection
Triguero, Isaac
del Río, Sara
López, Victoria
Bacardit, Jaume
Benítez, José M.
Herrera, Francisco
ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title_full ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title_fullStr ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title_full_unstemmed ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title_short ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
title_sort rosefw-rf: the winner algorithm for the ecbdl’14 big data competition: an extremely imbalanced big data bioinformatics problem
topic Bioinformatics; Big data; Hadoop; MapReduce; Imbalance classification; Evolutionary feature selection
url https://eprints.nottingham.ac.uk/45418/
https://eprints.nottingham.ac.uk/45418/
https://eprints.nottingham.ac.uk/45418/