Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analys...

Full description

Bibliographic Details
Main Authors: Glaab, Enrico, Bacardit, Jaume, Garibaldi, Jonathan M., Krasnogor, Natalio
Format: Article
Published: Public Library of Science 2012
Subjects:
Online Access:https://eprints.nottingham.ac.uk/1651/
_version_ 1848790646833283072
author Glaab, Enrico
Bacardit, Jaume
Garibaldi, Jonathan M.
Krasnogor, Natalio
author_facet Glaab, Enrico
Bacardit, Jaume
Garibaldi, Jonathan M.
Krasnogor, Natalio
author_sort Glaab, Enrico
building Nottingham Research Data Repository
collection Online Access
description Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
first_indexed 2025-11-14T18:15:56Z
format Article
id nottingham-1651
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T18:15:56Z
publishDate 2012
publisher Public Library of Science
recordtype eprints
repository_type Digital Repository
spelling nottingham-16512020-05-04T20:21:32Z https://eprints.nottingham.ac.uk/1651/ Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data Glaab, Enrico Bacardit, Jaume Garibaldi, Jonathan M. Krasnogor, Natalio Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes. Public Library of Science 2012-07 Article PeerReviewed Glaab, Enrico, Bacardit, Jaume, Garibaldi, Jonathan M. and Krasnogor, Natalio (2012) Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE, 7 (7). e39932. ISSN 1932-6203 gene protein expression microarray analysis literature mining classification machine learning prediction cancer cross-validation sample classification feature selection http://dx.doi.org/10.1371/journal.pone.0039932 doi:10.1371/journal.pone.0039932 doi:10.1371/journal.pone.0039932
spellingShingle gene
protein
expression
microarray analysis
literature mining
classification
machine learning
prediction
cancer
cross-validation
sample classification
feature selection
Glaab, Enrico
Bacardit, Jaume
Garibaldi, Jonathan M.
Krasnogor, Natalio
Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title_full Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title_fullStr Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title_full_unstemmed Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title_short Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
title_sort using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data
topic gene
protein
expression
microarray analysis
literature mining
classification
machine learning
prediction
cancer
cross-validation
sample classification
feature selection
url https://eprints.nottingham.ac.uk/1651/
https://eprints.nottingham.ac.uk/1651/
https://eprints.nottingham.ac.uk/1651/