Soria, Daniele
(2010)
Novel methods to elucidate core classes in multi-dimensional biomedical data.
PhD thesis, University of Nottingham.
Abstract
Breast cancer, which is the most common cancer in women, is a complex disease characterised by multiple molecular alterations. Current routine clinical management relies on availability of robust clinical and pathologic prognostic and predictive factors, like the Nottingham Prognostic Index, to support decision making. Recent advances in highthroughput molecular technologies supported the evidence of a biologic heterogeneity of breast cancer.
This thesis is a multi-disciplinary work involving both computer scientists and molecular pathologists. It focuses on the development of advanced computational models for the classification of breast cancer into sub-types of the disease based on protein expression levels of selected markers. In a previous study conducted at the University of Nottingham, it has been suggested that immunohistochemical analysis may be used to identify distinct biological classes of breast cancer.
The objectives of this work were related both to the clinical and technical aspects. From a clinical point of view, the aim was to encourage a multiple techniques approach when dealing with classification and clustering. From a technical point of view, one of the goals was to verify the stability of groups obtained from different unsupervised clustering algorithms, applied to the same data, and to compare and combine the different solutions with the ones available from the previous study. These aims and objectives were considered in the attempt to fill a number of gaps in the body of knowledge. Several research questions were raised, including how to combine the results obtained by a multi-techniques approach for clustering and whether the medical decision making process could be moved in the direction of personalised healthcare.
An original framework to identify core representative classes in a dataset was developed and is described in this thesis. Using different clustering algorithms and several validity indices to explore the best number of groups to split the data, a set of classes may be defined by considering those points that remain stable across different clustering techniques. This set of representative classes may be then characterised resorting to usual statistical techniques and validated using supervised learning. Each step of this framework has been studied separately, resulting in different chapters of this thesis. The whole approach has been successfully applied to a novel set of histone markers for breast cancer provided by the School of Pharmacy at the University of Nottingham. Although further tests are needed to validate and improve the proposed framework, these results make it a good candidate for being transferred to the real world of medical decision making.
Other contributions to knowledge may be extracted from this work. Firstly, six breast cancer subtypes have been identified, using consensus clustering, and characterised in terms of clinical outcome. Two of these classes were new in the literature. The second contribution is related to supervised learning. A novel method, based on the naive Bayes classifier, was developed to cope with the non-normality of covariates in many real world problems. This algorithm was validated over known data sets and compared with traditional approaches, obtaining better results in two examples.
All these contributions, and especially the novel framework may also have a clinical impact, as the overall medical care is gradually moving in the direction of a personalised one. By training a small number of doctors it may be possible for them to use the framework directly and find different sub-types of the disease they are investigating.
Actions (Archive Staff Only)
|
Edit View |