Large scale data mining challenge: Contact map prediction

Bhattacharya, Abhishek (2013) Large scale data mining challenge: Contact map prediction. [Dissertation (University of Nottingham only)]

[img] PDF - Registered users only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (4MB)


Proteins are considered to be the most important molecules that perform a variety of functions within living organisms. Proteins are made of amino acid sequences that tend to transform into a variety of three dimensional structures that determine the functions performed by them. Protein structure prediction (PSP) is pursued by bioinformatics that involves the prediction of these 3D structures which is one of the key unsolved problems. The PSP problems are approached by addressing simpler sub-problems through a divide and conquer strategy. The rationale behind this strategy is to develop methods for predicting some structural characteristics of the protein like protein secondary structure which is the most successful sub-problems addressed by bioinformatics. A crucial stepping stone for PSP is contact map predictions that help predict if two residues are in contact or not which determine the shape of a protein. PSP is a very challenging task addressed by machine learning methods where many sources of information are mixed to predict contacts between residues without a clear understanding of the contribution of each source in the prediction process.

A contact map predictor was developed at the University of Nottingham which has been for a few years one of the top predictors in the world. In this project, we focus on a thorough reassessment of all sources made use of by the Nottingham method and also evaluating some other sources that have recently appeared in literature. A series of experiments were performed in two stages which help us determine the contribution of each source and to evaluate new sources.

Results were thoroughly assessed using the CASP evaluation rules which led us to some remarkable results, where solvent accessibility evidenced to have a negative effect. Analysis of the rule sets generated by the machine learning system was performed to gain meaningful insight on the prediction process. Finally, we validated some new sources of information that have been established in literature to improve the prediction power.

Item Type: Dissertation (University of Nottingham only)
Depositing User: Gonzalez-Orbegoso, Mrs Carolina
Date Deposited: 25 Nov 2015 11:49
Last Modified: 19 Oct 2017 15:06

Actions (Archive Staff Only)

Edit View Edit View