Investigating ensemble methods for essential gene predictions in bacteria

Patel, Vanisha (2022) Investigating ensemble methods for essential gene predictions in bacteria. MPhil thesis, University of Nottingham.

[thumbnail of Corrected MPhil Thesis]
Preview
PDF (Corrected MPhil Thesis) (Thesis - as examined) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence Creative Commons Attribution.
Download (11MB) | Preview
[thumbnail of Corrections Summary]
Preview
PDF (Corrections Summary) (Thesis - as examined) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence Creative Commons Attribution.
Download (39kB) | Preview

Abstract

Essential genes are the genes required for an organism to survive in stable conditions with an abundance of nutrients. The identification of essential genes is important to both our understanding of bacterial organisms and our ability to manipulate them. Many machine learning methods have been proposed for the prediction of essential genes. However, the majority of these studies have a limited focus, i.e. a single optimised classifier and feature set combination to predict genes within the same organism. Therefore, as the models have a narrow scope they cannot be reliably applied to newly sequenced organisms. This ability of a model to generalise to new data can be improved by increasing the dataset and combining results from different classifiers.

The aim of this thesis was to develop an ensemble method to predict essential genes in bacteria. In total 62 commonly used sequence based features and 7 supervised learning classifiers were identified from the literature. Using online databases, 73 studies with high quality laboratory essentiality data were collated for 45 bacterial strains. To build the ensemble base learners, feature selection algorithms were used to generate feature subsets. Analysis of the subsets showed that while particular features were selected more frequently by the algorithms, no features were completely excluded. The performance of each subset with the classifiers was investigated to identify feature sets for the ensemble base learners.

Through studying the performance of the feature sets as part of a majority voting ensemble algorithm, we were able to show that for cross validation the ensemble approach performance was higher than the individual classifiers. This was confirmed through validation testing on organism with no matching genus in training data.

The results show that it is possible to improve the ability of a classifier to generalise to new organisms through the application of feature selection and ensemble learning.

Item Type: Thesis (University of Nottingham only) (MPhil)
Supervisors: Twycross, Jamie
Keywords: synthetic biology, computational biology, essential genes, ensemble method
Subjects: Q Science > QH Natural history. Biology > QH426 Genetics
Faculties/Schools: UK Campuses > Faculty of Science > School of Computer Science
Item ID: 69028
Depositing User: Patel, Vanisha
Date Deposited: 02 Aug 2022 04:40
Last Modified: 02 Aug 2022 04:40
URI: https://eprints.nottingham.ac.uk/id/eprint/69028

Actions (Archive Staff Only)

Edit View Edit View