Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput expe...

Full description

Bibliographic Details
Main Author: Glaab, Enrico
Format: Thesis (University of Nottingham only)
Language:English
Published: 2011
Subjects:
Online Access:https://eprints.nottingham.ac.uk/12727/
_version_ 1848791568707747840
author Glaab, Enrico
author_facet Glaab, Enrico
author_sort Glaab, Enrico
building Nottingham Research Data Repository
collection Online Access
description Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings.
first_indexed 2025-11-14T18:30:35Z
format Thesis (University of Nottingham only)
id nottingham-12727
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T18:30:35Z
publishDate 2011
recordtype eprints
repository_type Digital Repository
spelling nottingham-127272025-02-28T11:21:02Z https://eprints.nottingham.ac.uk/12727/ Analysing functional genomics data using novel ensemble, consensus and data fusion techniques Glaab, Enrico Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings. 2011-10-15 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/12727/1/thesis_hardbound_final.pdf Glaab, Enrico (2011) Analysing functional genomics data using novel ensemble, consensus and data fusion techniques. PhD thesis, University of Nottingham. Gene protein expression microarray analysis pathway analysis omics machine learning network analysis topology clustering prediction feature selection
spellingShingle Gene
protein
expression
microarray analysis
pathway analysis
omics
machine learning
network analysis
topology
clustering
prediction
feature selection
Glaab, Enrico
Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title_full Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title_fullStr Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title_full_unstemmed Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title_short Analysing functional genomics data using novel ensemble, consensus and data fusion techniques
title_sort analysing functional genomics data using novel ensemble, consensus and data fusion techniques
topic Gene
protein
expression
microarray analysis
pathway analysis
omics
machine learning
network analysis
topology
clustering
prediction
feature selection
url https://eprints.nottingham.ac.uk/12727/