Investigating ensemble methods for essential gene predictions in bacteria

Essential genes are the genes required for an organism to survive in stable conditions with an abundance of nutrients. The identification of essential genes is important to both our understanding of bacterial organisms and our ability to manipulate them. Many machine learning methods have been propo...

Full description

Bibliographic Details
Main Author: Patel, Vanisha
Format: Thesis (University of Nottingham only)
Language:English
English
Published: 2022
Subjects:
Online Access:https://eprints.nottingham.ac.uk/69028/
Description
Summary:Essential genes are the genes required for an organism to survive in stable conditions with an abundance of nutrients. The identification of essential genes is important to both our understanding of bacterial organisms and our ability to manipulate them. Many machine learning methods have been proposed for the prediction of essential genes. However, the majority of these studies have a limited focus, i.e. a single optimised classifier and feature set combination to predict genes within the same organism. Therefore, as the models have a narrow scope they cannot be reliably applied to newly sequenced organisms. This ability of a model to generalise to new data can be improved by increasing the dataset and combining results from different classifiers. The aim of this thesis was to develop an ensemble method to predict essential genes in bacteria. In total 62 commonly used sequence based features and 7 supervised learning classifiers were identified from the literature. Using online databases, 73 studies with high quality laboratory essentiality data were collated for 45 bacterial strains. To build the ensemble base learners, feature selection algorithms were used to generate feature subsets. Analysis of the subsets showed that while particular features were selected more frequently by the algorithms, no features were completely excluded. The performance of each subset with the classifiers was investigated to identify feature sets for the ensemble base learners. Through studying the performance of the feature sets as part of a majority voting ensemble algorithm, we were able to show that for cross validation the ensemble approach performance was higher than the individual classifiers. This was confirmed through validation testing on organism with no matching genus in training data. The results show that it is possible to improve the ability of a classifier to generalise to new organisms through the application of feature selection and ensemble learning.