Statistical analysis of proteomic mass spectrometry data

Handley, Kelly (2007) Statistical analysis of proteomic mass spectrometry data. PhD thesis, University of Nottingham.

[thumbnail of thesis_final.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (21MB) | Preview

Abstract

This thesis considers the statistical modelling and analysis of proteomic mass spectrometry data. Proteomics is a relatively new field of study and tried and tested methods of analysis do not yet exist. Mass spectrometry output is high-dimensional and so we firstly develop an algorithm to identify peaks in the spectra in order to reduce the dimensionality of the datasets. We use the results along with a variety of classification methods to examine the classification of new spectra based on a training set. Another method to reduce the complexity of the problem is to fit a parametric model to the data. We model the data as a mixture of Gaussian peaks with parameters representing the peak locations, heights and variances, and apply a Bayesian Markov chain Monte Carlo (MCMC) algorithm to obtain their estimates. These resulting estimates are used to identify m/z values where differences are apparent between groups, where the m/z value of an ion is its mass divided by its charge. A multilevel modelling framework is also considered to incorporate the structure in the data and locations exhibiting differences are again obtained.

We consider two mass spectrometry datasets in detail. The first consists of mass spectra from breast cancer cells which either have or have not been treated with the chemotherapeutic agent Taxol. The second consists of mass spectra from melanoma cells classified as stage I or stage IV using the TNM system. Using the MCMC and multilevel techniques described above we show that, in both datasets, small subsets of the available m/z values can be identified which exhibit significant differences in protein expression between groups. Also we see that good classification of new data can also be achieved using a small number of m/z values and that the classification rate does not fall greatly when compared with results from the complete spectra. For both datasets we compare our results with those in the literature which use other techniques on the same data. We conclude by discussing potential areas for further research.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Dryden, Ian L.
Browne, William J.
Keywords: Markov chain Monte Carlo, MCMC, multilevel modelling, classification, high-dimensional, bioinformatics
Subjects: Q Science > QA Mathematics > QA276 Mathematical statistics
Q Science > QH Natural history. Biology > QH301 Biology (General)
Q Science > QP Physiology > QP501 Animal biochemistry
Faculties/Schools: UK Campuses > Faculty of Science > School of Mathematical Sciences
Item ID: 10287
Depositing User: EP, Services
Date Deposited: 22 Oct 2007
Last Modified: 19 Oct 2017 11:30
URI: https://eprints.nottingham.ac.uk/id/eprint/10287

Actions (Archive Staff Only)

Edit View Edit View