Sun, Ke
(2019)
An investigation into visual content understanding based on deep learning and natural language processing.
PhD thesis, University of Nottingham.
Abstract
A long standing goal of artificial intelligence is to enable machines to perceive the visual world and interact with humans using natural language. To achieve this goal, many computer vision and natural language processing techniques have been proposed during the past decades, especially deep convolutional neural networks (CNNs). However, most previous work mainly focus on the two sides separately, and few work have been done by connecting the vision and language modalities. Hence, the semantic gap between the two modalities still exists.
To solve this, the overall objective of my PhD research is to design machine learning algorithms for visual content understanding by connecting the vision and language modalities. Towards this goal, we have developed several deep learning models combined with natural language processing techniques to represent and analyze the image/video and text data. We focus on a series of applications to demonstrate the effectiveness of the proposed models and obtain promising results.
Firstly we show that tag-based image annotations exhibit many limitations for visual content representation, and then develop techniques to discover visual themes as an alternative by re-organizing the original image and tag set into a group of visual themes. More concretely, we extract visual feature and semantic feature from two trained deep learning models respectively, and then design a method to effectively evaluate the similarity between a pair of tags both visually and semantically. Next we cluster these tags into a set of visual themes based on their joint similarities. We conduct human-based evaluation and machine-based evaluation to demonstrate the usefulness and rationality of the discovered visual themes and indicate their potential usage for automatically managing user photos.
Secondly we develop a novel framework for understanding complex video scenes involving objects, humans, scene backgrounds and the interactions between them. In this work, we propose to automatically discover semantic information for these videos from an unsupervised manner. We do this by introducing a set of semantic attributes derived from a joint image and text corpora. Then we re-train a deep constitutional neural network to produce visual and semantic features simultaneously. Our trained model encodes the complex information including the whole visual scene, object appearance/property and motion. We apply our model to solve the video summarization problem by adopting a partially near duplicate image discovery technique to cluster visually and semantically consistent video frames together. The experimental results demonstrate the effectiveness of the semantic attributes in assisting the visual features in the video summarization and our new technique (SASUM) achieves state-of-the-art performance.
Thirdly we decide to depict and interpret the traffic scene using vehicle objects. Firstly, we collect an hour-long traffic video with a resolution of 3840X2160 at 5 busy intersections of a megacity by flying an UAV during the rush hours. We then build a UavCT dataset containing over 64K annotated vehicles in 17K 512X512 images. In the next, we design and train a deep constitutional neural network from scratch to detect and localize different types of road vehicles (i.e. car, bus and truck), and propose a fast online tracking method to track and count vehicles in consecutive video frames. We design vehicle counting experiments on both image and video data to demonstrate the effectiveness of the proposed method.
Fourthly we extend our work in the previous stage and explore more potential applications of our deep model for traffic scene understanding. Specifically, we first track all the target vehicles from the original video, and then recognize vehicle behaviors based on nearest neighbor search (clustering) and bidirectional long short-term memory (classification). By conducting comparative studies, we further demonstrate the effectiveness and versatility of our approach for object-based visual scene understanding.
As last, we focus on TV series video understanding and develop techniques to recognize characters in these videos. Using label-level supervision, we transform the problem to multi-label classification and design a novel semantic projection network (SPNet) consisting of two stacked subnetworks with specially designed constraints. The first subnetwork is aiming to reconstruct the input feature activations from a trained single-label CNN, and the other one functions as a multi-label classifier which predicts the character labels as well as reconstructing the input visual features from the mapped semantic label space. We show such kind of mutual projection significantly benefits the character recognition by conducting experiments on three popular TV series video datasets. We also show that region-based prediction strategy could further improve the overall performance.
Our work contribute to a developing research field that demonstrates the power of deep learning techniques in solving different visual recognition problems by connecting the vision and language modality and advance the state-of-the-art on various tasks.
Actions (Archive Staff Only)
|
Edit View |