Artificial Neural Network System for Cell Classification using Single Cell RNA Expression

We implemented an automated system for single-cell classification using artificial neural networks (ANN). Our system takes single-cell gene expression sparse matrices and trains ANN to classify cell types and subtypes. The assemblies of ANNs predict cell classes by voting. We tested the system in a case study where we trained ANNs with a dataset containing approximately 120,000 single cells and tested the resulting model using an independent data set of 13,000 single cells. The overall accuracy of the 5-class classification was 95%. We trained and tested a total of 100 ANNs in 10 cycles. The prediction system demonstrated excellent reproducibility. The analysis of misclassifications indicated that 2% were likely classification errors, while the remaining 3% were likely due to mislabeled types and subtypes in the test set.


INTRODUCTION
Single-cell transcriptomics (SCT) examines gene expression profiles in individual cells. It is used for biomarker discovering and studies of heterogeneity of cells involved in biological processes [1]. Bulk RNA sequencing (RNA-seq) provides information on average gene expression across thousands or even millions of cells from the same sample. Bulk RNA-seq methods cannot capture cellular heterogeneity. The profiles of cell subtypes present in the sample remain unknown. Single-cell RNA sequencing (scRNA-seq) produces a more detailed view of RNAseq data because it enables us to assign counts of expressed genes to individual cells [2].
Within the same cell type, RNA expression can be quite different between individuals. Numerous factors, both biological and technical, influence changes in gene expression [3,4]. Biological factors include cell development stages, interactions between cells, and cellular responses to biological or environmental stimuli. In addition to genetic inheritance factors, gene expression changes may be due to the genetic program embedded inside each cell (ontogeny), development stage, activation history, healthy or disease status, age, and other factors [5]. Technical factors include sample processing and storage conditions. For example, the extraction of peripheral blood mononuclear cells (PBMC), freezing, fluorescence-activated cell sorting (FACS) steps, or enrichment methods will influence gene expression relative to the previous sample processing step. The high variability of scRNA-seq data makes cell classification, biomarker discovery, and identification of developmental trajectories challenging tasks [6].
More than 3000 SCT sparse matrices data sets have been generated to date using 10x GemCode Technology [7] and made publicly available. However, the number of captured genes per cell is limited and, in our estimate, representing anywhere between 2 to 30% of genes expressed in most single cells. Thus, gene expression matrices generated by scRNAseq techniques are sparse and are excellent targets for machine learning and data analytics.
When we train a classification system using instances with known class labels, the learning is called supervised learning. If the instances are unlabeled, the instances are grouped by unsupervised learning. Unsupervised clustering algorithms are used to map items to common classes and discover new, meaningful classes [8].
Current data analytics methods of SCT rely on unsupervised clustering [6]. Unsupervised machine learning methods have a disadvantage that they do not scale up well. Each study requires a combination of manual annotation, and clustering algorithms that perform well on specific datasets may not perform well on datasets from different studies (lack of generalization) [9]. Machine learning requires that every instance (a single cell) in studied datasets is represented using the same set of features (genes in sparse matrices). It is essential that we have standardized data sets for unsupervised machine learning.
Our group has collected, cleaned, and standardized more than 1,000 human datasets [10] and more than 800 datasets of mouse SCT sparse matrices [11]. It is time-consuming to study such large datasets one-by-one, so we have developed a system for automated classification of SCT data. Here we report a system for automation of ANN training and testing using sparse matrices generated by 10x technology.We also report the prediction of single-cell classes from previously unseen data sets.
Our group previously implemented an ANN to classify PBMC into five major cell types: B cells, dendritic cells (DC), monocytes, natural killer (NK) cells, and T cells [10]. The previous study demonstrated that an ANN can be trained for classification of major PBMC cell types. The performance was decent -90% accuracy was achieved in 5-class classification. Here, we report an extended analysis, classification of PBMC data using a more extensive training data set. In this study, we focused on a) assessing the generalization properties of ANN classifiers for SCT classification tasks, b) identifying ways to improve generalizable accuracy of ANN classification, and c) exploring reproducibility of machine learning using assemblies of ANNs.

II. STUDY DESIGN
We defined a standardized string of gene names. It maps gene names from multiple genome assemblies [12] to a common name string. All SCT data used in our study were mapped onto this common list that defined standardized sparse matrix format [10]. Before performing classification and the analysis of classification results, we divided our data into training and testing data sets. The overall design of this study involved four steps: preparation of data sets, building ANN classifier that randomly selects initialization seeds to train multiple ANNs, validation of training, and result analysis ( Fig. 1).

A. Data
Data were extracted from three public SCT data sources, GEO database (GEOS data set) [13], Broad Institute database (BroadS1 and BroadS2 data sets) [14], and the 10x company demonstration data [15]. A total of 52 SCT data sets were collected and prepared for the analysis. Each data set has a metadata description. Metadata descriptions provide information about sample collection, processing, and the conditions of experiments. We labeled each data set to reflect their PBMC cell types and subtypes, T cells, B cells, DC, Monocytes, and NK cells. For classification, each data set was labeled with one of the five cell type labels. For analysis, the cell subtype was also used for the assessment of correct classifications and misclassifications. TABLE I shows the number of data sets used in this study, their cell types and sources. TABLE II shows the total number of cells by cell types and sources. Fig. 1. The overview of this study. In this study we focus mainly on the steps on the right side of the diagram. The improvement strategies will involve the addition of new data sets and cell subtypes, not yet explored in this report.

B. Quality control
All data used in this study passed our internal quality control. Cells that have 300 or more positive features and 670 or more total counts were selected. We determined the 300 and 670 thresholds empirically from observations of feature count distributions -these thresholds are mostly within the linear part of the S-curve representing the indexed list of counts. The high-end thresholds were not applied in this study (e.g. removing cells that have number of features or counts larger than some defined value).

C. Artificial Neural Networks
We used the same architecture, training algorithm, and stopping criteria, as reported earlier [10]. In short, the ANN architecture was 30698-10-5, representing the number of units in input, hidden, and output layers. The MLPClassifier function from Scikit-learn python library was employed to encode a multi-layer perceptron classifier. We used the following parameters: activation: rectified linear unit (ReLU), solver: adam, alpha: 0.0001, batch size: 200, initial learning rate: 0.001. The default values were used for other parameters.
Each training-testing-assessment run consisted of ten cycles. One ANN was trained in each cycle, using a randomly generated initialization seed. This ensured that the trained ANN models were different in each cycle. The results of each run were analyzed for reproducibility. To assess generalization, we performed ten runs of the system and compared the results of all individual cycles within each run. Reprodycibility between cycles was assessed by the comparison of composite results for each cycle. More details can be found in the IMPLEMENTATION section.

D. Assessment of performance
To compare the results of individual cycles, we calculated the values of Precision (PR), Recall (RE), and F1-value for each individual cell class, and the overall accuracy (ACC) across five classes: where TP, FP, and FN stand for true positives, false positives, and false negatives, respectively. ACC is calculated by using sums of TP and FN across the classes.

A. Data management system
Data management for SCT technology is challenging because sparse matrices make Big Data [16]. Because SCT technology is developing rapidly, there is a lack of standardized data sets and of general data standards [17]. In addition, unsupervised machine learning methods, that are methods of choice for SCT data analysis, do not generalize well [18]. Unsupervised methods, therefore, cannot be automatically applied for the analysis of data from multiple studies. The raw data are available in five formats (TXT, CSV, TSV, H5, and MTX). The raw data formats were converted into MTX format (math.nist.gov/MatrixMarket/formats.html) using an in-house software. In a sparse matrix, 95-99% of features are typically zero. In a file with >30,000 rows and >10,000 columns, one CSV file may exceed one Gb. MTX file for sparse matrices stores only non-zero values and their coordinates, thus reducing the memory requirements by approximately an order of magnitude. Files were named so that they inform the user about the cell type, data source, and the cell number in each sparse matrix.

B. ANN prediction system
Our system uses random initialization seeds for ANN to ensure the diversity of initial models. ANN prediction method with only one ANN model may produce a high accuracy model in one study, but it will not necessarily have similar accuracy in other studies. Our system trains multiple ANN models that are not mutually identical to ensure that the assessed accuracy is realistic, and the models generalize well. We performed a case study to test the system and demonstrate prediction capabilities. Training datasets were from 10x, GEO, and BroadS2, while the testing dataset was BroadS1. Our system repeats training and testing cycles 10 times by default. Each of the 10 ANNs was trained and tested using an identical procedure, so they are fully comparable. The sum of the signals from five output units (representing five PBMC classes) in a single ANN system is always one: Bo+DCo+MCo+NKo+To=1. The overall prediction score in our system (using 10 models per cycle) is 10. The final classification result of each 10-models cycle is presented as a set of 11 confusion matrices: one matrix for the overall classification result and one matrix for each individual model. The prediction result can be used for performance assessment and misclassification analysis. The activity diagram of our ANNs prediction system is shown in Fig. 2.

C. Case study
In this study, we predicted five main subtypes of peripheral blood mononuclear cells (PBMC). Our case study has extended the PBMC analysis reported in [10]. We performed ten cycles of analysis, where each cycle had ten training-testing iterations. A total of 100 ANNs were trained in ten prediction cycles, and the results of cycles were used to assess the reproducibility and generalizability of our prediction system.

A. ANN prediction results
The accuracy of individual ANN models ranged from Acc=0.902 to Acc=0.949. The average value of individual accuracy across the 100 ANN models was 0.939±0.010. The overall accuracy of the 10 cycles, based on voting strategy, was between Acc=0.943 and Acc=0.949. The average of all 10 cycle accuracies was Acc=0.947±0.002. The results for Cycle 1 are shown in Fig. 3. The overall accuracy of classification using voting in 10 iterations is higher than the average accuracy of ANNs in each cycle.  3.5% of T cells were misclassified as NK cells. All other classclass misclassifications were below 2.5%. The poor classification of DC can be explained by the small number of DC instances in the training set (only 270 of 124,431 or 0.22% of the total). Furthermore, DC in the training set were composed of a mix of two subtypes (conventional and plasmacytoid DCs), while the test set did not have subclass annotation. Misclassification of NK cells to T cells and vice versa can, at least partially, be explained by the existence of NK-like T cells [19]. Further analysis is needed to resolve classification discrepancies.

I. CONCLUSIONS AND DISCUSSION
We developed and implemented an ANN-based prediction system that deploys an assembly of ANNs that classify single cell types by assembly voting. Using a test case of PBMC classification, we demonstrated that ANN training, using large-scale data sets generated from different and unrelated single-cell studies (10x technology), are excellent multi-class predictors. The overall accuracy of ANN assembly voting is 95%, and the prediction results were reproducible across all training-testing cycles. A vast majority of correctly classified cells class show excellent agreements with individual predictors. The majority of misclassified cells also have uniform voting profiles. The assembly method shows a small but consistent improvement across all cycles and individual training-testing iterations. The assembly vote matches or exceeds the accuracy of the best individual ANN within a given cycle. In summary, we conclude that the proposed ANN method produces highly reproducible classification results. We estimate that 98% of correct predictions have uniform votes across assembly, and 2% have ambiguous votes. Approximately 80% of misclassified cells have uniform votes. We hypothesize that most of these represent subclasses that are not clearly defined in our data set, such as NK-like CD8+ T cells [19].
Further studies will focus on in-depth analysis of features that characterize misclassified cells and identify their true class. We estimate that, most likely, the true misclassification rate in our system is approximately 2%. We plan to deploy our system for the classification of other cell types with available SCT data sets.