Data mining techniques for protein sequence analysis
Hamby, Stephen Edward (2010) Data mining techniques for protein sequence analysis. PhD thesis, University of Nottingham.
This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods.
Actions (Archive Staff Only)