Statistical shape analysis of large molecular data sets

Hennessey, Anthony (2018) Statistical shape analysis of large molecular data sets. PhD thesis, University of Nottingham.

[img]
Preview
PDF (Thesis - as examined) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (9MB) | Preview

Abstract

Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level.

Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration.

The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Fallaize, Christopher
Dryden, Ian L.
Le, Huiling
Subjects: Q Science > QA Mathematics > QA 75 Electronic computers. Computer science
Q Science > QP Physiology > QP501 Animal biochemistry
Faculties/Schools: UK Campuses > Faculty of Science > School of Mathematical Sciences
Item ID: 52088
Depositing User: Hennessey, Anthony
Date Deposited: 19 Jul 2018 04:40
Last Modified: 07 May 2020 17:30
URI: https://eprints.nottingham.ac.uk/id/eprint/52088

Actions (Archive Staff Only)

Edit View Edit View