A Random Forest Approach to Understanding CRISPR-Cas Associations in BacteriaTools Millar, Kyle Jamie (2025) A Random Forest Approach to Understanding CRISPR-Cas Associations in Bacteria. MRes thesis, University of Nottingham.
AbstractCRISPR-Cas systems are a crucial and intriguing defence mechanism found in bacteria and archaea. This defence mechanism is able to adapt and defend against attacks from Mobile Genetic Elements (MGEs). This mechanism also has many uses outside of genomic defence. For example, certain types of CRISPR-Cas proteins allow for modification of eukaryotic genomes in vivo. Currently, there are applications and algorithms capable of finding CRISPR-Cas types and the associated arrays, however, the idea of predicting whether a genome might contain a CRISPR-Cas locus, based purely on the background genome content offers a faster query time. To test whether the presence of CRISPR-Cas systems could be predicted from the background genome, a Random Forest algorithm was employed using a large data set - a bacterial pangenome containing 9,689 genomes. To annotate this pangenome with ’CRISPR’ identifiers, the annotation tool Bakta was used, allowing for the use of custom scripts to find the relevant information needed from the annotated genomes. The algorithm was shown to have an accuracy of 0.89, and an AUC-ROC score of 0.96. These results imply a strong ability to classify the predictions correctly, based on background genome content. The algorithm calculated the ’feature importance’ of all genes that were present in the pangenome; the gene of highest importance was ’pbp4b’ followed closely by ’csy3’ (a positive control variable). The ten genes that had the highest feature importance all had a statistically significant association with CRISPR-Cas systems when evaluated using chi-squared tests. The algorithm was capable of predicting CRISPR-Cas systems in γ-proteobacteria and offers potential for research candidates when investigating CRISPR-Cas associations. This approach could be used to predict CRISPR-Cas more broadly across prokaryotic life, upon data availability.
Actions (Archive Staff Only)
|
Tools
Tools