High dimensional low sample size data analysis: a probabilistic graphical model based approach for feature extraction and classification

Verghese, Sheena Leeza (2020) High dimensional low sample size data analysis: a probabilistic graphical model based approach for feature extraction and classification. PhD thesis, University of Nottingham.

[img] PDF (Thesis - as examined) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence Creative Commons Attribution.
Download (4MB)

Abstract

High dimensional small sample sized (HDLSS) datasets are datasets which contain many features but a limited number of samples. High dimensional low sample size datasets are commonly found in microarray data and medical imaging (Hall et al.). Most algorithms were not created with high dimensional low sample size data in mind. Due to this, predictions made with these datasets do not perform well. Classification algorithms select and discriminate against features which give good prediction. Most classification algorithms assume the Euclidean space. In the Euclidean space, all the data points form an equidistant simplex under the HDLSS asymptotic if the data points are drawn independently from a distribution. Hence, using Euclidean based distance metrics as commonly done by metric learning and K-Nearest Neighbour becomes a hard task. Classifiers such as Support Vector Machines (SVM) have issues with data piling. Support vectors tend to pile on one another causing difficulty in finding the optimal hyperplane. Algorithms such as neural networks and manifold learning require reasonably large amounts of samples to accurately model the data. Further, neural networks, though they can capture the intrinsic dimensionality of the data, they have a black box problem which reduces interpretability of the model. In medical and biological fields, it is desirable to be able to map backwards to the input features that contribute to the prediction. Classifiers also have a tendency to perform poorly due to the misrepresentation of the structure of the data when the density estimation degenerates. The degeneration of the estimation occurs due to the lack of samples. Therefore, it can be seen that that the challenges that are faced by HDLSS data for classification are two fold: i) high dimensionality of the data ii)low sample size of the data.

In our research, we only deal with the high dimensionality problem of HDLSS data. Generating more samples which is usually done to increase the samples size is not in the scope of this research. In order to deal with high dimensionality for classification, common techniques in the literature include i) preprocessing the feature space through the use of feature extraction or selection prior to classification ii) using models with latent structures to capture the intrinsic dimensionality of the data. Feature selection and extraction methods that have been previously explored deal with high dimensional asymptotics. This means that although the dimensionality is high, the number of samples is larger than the dimensionality size. In the Euclidean space, unlike points in HDLSS data which become equidistant, the data points lie on a narrow strip on the Euclidean sphere. Classical feature extraction methods such as Principle Component Analysis (PCA) which works well under high dimensional asymptotic becomes unstable after a certain point as the eigenvalues can become zero in the HDLSS case (Bishop, 2006). Due to the nature of the data which tends to have correlated features, finding higher order interactions may be potent. Brown et al. (2012) studied the application of information theory based feature selection methods and found that only some methods perform well in HDLSS circumstances. In light of this, we use Correlation Explanation (CorEx) (Ver Steeg and Galstyan, 2014a) which is reported to be able to handle high dimensional low sample size data through the use of marginals rather than the full probability distribution of the data to estimate the probability density function.

Correlation Explanation is an unsupervised probabilistic graphical model based on information theory. Unlike Neural Networks, it is also easily interpretable. Using CorEx, we deal with the high dimensionality challenge in HDLSS data for classification by i) proposing feature selection and extraction under the discriminant classification model ii) proposing an inference algorithm under the generative classification model. In order to fulfil the two proposed methods, we needed to model the CorEx algorithm such that class labels can be incorporated. We call this model CorEx-C. The CorEx-C model is able to accurately determine features that are related to the class label with HDLSS data provided its assumptions are not violated. Using the CorEx-C model, the feature extraction and selection methods that were proposed is competitive with the feature selection methods that work with HDLSS data. In the literature,the current feature selection and extraction methods are catered mainly for high dimensional data. We show through the comparison between some existing methods that not all methods can be directly extended to HDLSS settings. This proposed feature selection and extraction method extends the research to aim for HDLSS settings. It also brings into view that more research is required to circumvent the curse of dimensionality in the HDLSS setting. Finally, our proposed inference algorithm (CorEx-Ci) was constructed using mean field approximation. Current classification techniques generally require preprocessing the data with dimensionality reduction when the dimensions are very high. They are also known to have many drawbacks in the HDLSS setting as they were not built with HDLSS data in mind. The exception to this would be Distance Weighted Discriminant and Maximal Data Piling, which were created for the HDLSS domain.These methods were proposed in literature as an extension to the SVM algorithm for HDLSS data. Our proposed inference algorithm aims to tackle inferencing for HDLSS data from the vantage point of probabilistic graphical model under the generative classification model umbrella. It performs comparatively with other classification techniques. We ran controlled experiments using simulated data to test the capability of a few different classification models and CorEx-Ci when sample size varies and the number of features varies. The increased sample space and feature space only causes slight changes to the performance of CorEx-Ci. However, CorEx-Ci has a slightly higher asymptotic error compared to other classification algorithms possibly due to the violation of the data generation assumption. A future work would be expand this assumption and add constraints to the training phase to accommodate it.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Yi, Iman Liao
Maul, Tomas
Chong, Siang Yew
Keywords: high dimensional low sample size, classification, correlation explanation, probabilistic graphical model, dimensionality reduction
Subjects: Q Science > QA Mathematics
Faculties/Schools: University of Nottingham, Malaysia > Faculty of Science and Engineering — Science > School of Computer Science
Item ID: 61018
Depositing User: Verghese, Sheena
Date Deposited: 27 Jul 2020 09:03
Last Modified: 30 Dec 2021 04:30
URI: https://eprints.nottingham.ac.uk/id/eprint/61018

Actions (Archive Staff Only)

Edit View Edit View