Jiménez Morales, Manuel Alejandro
(2020)
Novel automated classification approaches for citizen science.
PhD thesis, University of Nottingham.
Abstract
Citizen science, traditionally known as the engagement of amateur participants in research, is showing a great potential for large-scale processing of data. In areas such as astronomy, ecology, or geo-sciences, where emerging technologies generate huge volumes of data, citizen science projects enable image classification at a rate not possible to accomplish by experts alone. Using the power of the web, virtual communities of volunteers sharing a common goal are able to coordinate the classification of hundreds of thousands of images in a reasonable amount of time. However, expert evaluations usually reveal biases and uncertainty in the results, since the participants involved are typically inexperienced in the task and hold variable skills and backgrounds. Consequently, the research community tends to distrust citizen science outcomes, claiming a generalised lack of accuracy and validation, and leaving the major part of the resulting data unemployed after the finalisation of the projects.
Citizen science also offers a great amount of labelled data at a reduced cost for the training of machine learning classifiers. Nonetheless, current efforts attempting the exploitation of citizen science outcomes with machine learning tools have ignored the inherent uncertainty in results as well as the potential of expert classifications to ameliorate this issue. The ultimate goal has mainly been to replicate the amateur endeavours, thus propagating their biases and limitations in the automated classification. Similarly, the potential behind the learning from unlabelled data to alleviate this uncertainty has also been disregarded. This framework claims for a solution that can take advantage of all levels of knowledge: expert classifications, citizen science data, and unlabelled data. However, the synergy between these sources of data remains unexplored, waiting for the development of new methodologies that may lead to an enhanced automated classification.
This thesis focuses on the development of automated approaches for classification problems aided by citizen science projects on the web, aiming to leverage the inherent uncertainty in the results and all levels of knowledge available about the problem. As a case study, we select the longest running implementation of a scientific problem aided by modern citizen science: the classification of galaxies from images. We exploit the results of the first edition of the Galaxy Zoo, a citizen science project that nowadays represents the largest galaxy image database manually annotated. The research is completed through three progressive stages. First, we introduce a novel multi-stage approach to handle the uncertainty within data labelled in the course of citizen science projects. Our method proposes a set of transformations that leverage the uncertainty in amateur classifications in conjunction with a hybridisation strategy that provides the best aggregation of the transformed data for improving the quality and confidence in the results. The second stage comprises a thorough study of machine learning methods for image classification, introducing the use of autoencoders to learn from unlabelled data, and exploring the learning from amateur and expert classifications by the exploitation of pre-training and fine-tuning of convolutional neural networks. Finally, in the third stage of the research, the previous findings are combined to propose a solution to the novel learning paradigm defined that is able to exploit data either labelled by experts and amateurs in the course of citizen science projects, and unlabelled data.
In summary, the research conducted here introduces a set of novel mechanisms towards an improved automated classification based on citizen science data, expert classifications, and raw data. As a result, the proposed method for handling the uncertainty boosts the accuracy and is able to classify a higher number of images in comparison with previous approaches. This is accomplished by taking advantage of the uncertainty measured by participants themselves. The use of autoencoders greatly speeds up feature extraction with respect to state-of-the-art methods, also revealing the potential behind the exploitation of amateur and expert classifications by deep learning-based classifiers. In last place, a novel approach leverages all insights previously found and presents an innovative setting to learn from expert and amateur classifications and unlabelled data that surpasses the performance obtained using such label sets separately or joint. These results have also signified a global study of the automated classification of galaxy images problem that, from state-of-the-art approaches, have contributed new methods built on the boundary amongst citizen science, astroinformatics, and machine learning fields of study.
Actions (Archive Staff Only)
|
Edit View |