A Random Forest Approach to Understanding CRISPR-Cas Associations in Bacteria

Millar, Kyle Jamie (2025) A Random Forest Approach to Understanding CRISPR-Cas Associations in Bacteria. MRes thesis, University of Nottingham.

[thumbnail of Kyle Jamie Millar Thesis (2).pdf]
Preview
PDF (Thesis - as examined) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence Creative Commons Attribution.
Download (2MB) | Preview

Abstract

CRISPR-Cas systems are a crucial and intriguing defence mechanism found in bacteria and archaea. This defence mechanism is able to adapt and defend against attacks from Mobile Genetic Elements (MGEs). This mechanism also has many uses outside of genomic defence. For example, certain types of CRISPR-Cas proteins allow for modification of eukaryotic genomes in vivo. Currently, there are applications and algorithms capable of finding CRISPR-Cas types and the associated arrays, however, the idea of predicting whether a genome might contain a CRISPR-Cas locus, based purely on the background genome content offers a faster query time. To test whether the presence of CRISPR-Cas systems could be predicted from the background genome, a Random Forest algorithm was employed using a large data set - a bacterial pangenome containing 9,689 genomes. To annotate this pangenome with ’CRISPR’ identifiers, the annotation tool Bakta was used, allowing for the use of custom scripts to find the relevant information needed from the annotated genomes. The algorithm was shown to have an accuracy of 0.89, and an AUC-ROC score of 0.96. These results imply a strong ability to classify the predictions correctly, based on background genome content. The algorithm calculated the ’feature importance’ of all genes that were present in the pangenome; the gene of highest importance was ’pbp4b’ followed closely by ’csy3’ (a positive control variable). The ten genes that had the highest feature importance all had a statistically significant association with CRISPR-Cas systems when evaluated using chi-squared tests. The algorithm was capable of predicting CRISPR-Cas systems in γ-proteobacteria and offers potential for research candidates when investigating CRISPR-Cas associations. This approach could be used to predict CRISPR-Cas more broadly across prokaryotic life, upon data availability.

Item Type: Thesis (University of Nottingham only) (MRes)
Supervisors: McInerney, James
O'Connell, Mary
Ono, Jasmine
Keywords: CRISPR, CRISPR-Cas elements, Bacteria, archaea, Cas proteins
Subjects: Q Science > QR Microbiology
Faculties/Schools: UK Campuses > Faculty of Medicine and Health Sciences > School of Life Sciences
Item ID: 81536
Depositing User: Millar, Kyle
Date Deposited: 31 Dec 2025 04:40
Last Modified: 31 Dec 2025 04:40
URI: https://eprints.nottingham.ac.uk/id/eprint/81536

Actions (Archive Staff Only)

Edit View Edit View