Video summarisation techniques on user-generated videos and character-oriented TV series

Lei, Zhuo (2020) Video summarisation techniques on user-generated videos and character-oriented TV series. PhD thesis, University of Nottingham.

[img] PDF (Thesis - as examined) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (15MB)

Abstract

Video summarisation techniques is able to convey significant contents to represent the original video, which is sampling the frames to reduce the video length. Traditional related researches focus on edited videos, but exponentially increasing user-generated videos makes traditional methods ineffective, since they are often edited and only contain one shot. Moreover, it is difficult to achieve some universal criteria to generate summaries, even one single user may give different definitions for the same video. Thus, user-generated video summarisation is a subjective task. In the recent years, some researchers summarise videos by concentrating on some specified persons or objects. Hence, TV episodes are chosen, which is a strong platform consisting of various and vast information, as the experimental material to find all the occurrences of some specified characters.

Therefore, one of the objectives of this research is to work with user-generated videos to temporally segment them to facilitate video summarisation and select significant content to generate summaries. The other goal of this research is to identify all the occurrence of some specified characters on TV episodes, which is a pre-step for the future work - person-oriented summarisation for user-generated videos. Generally, some methods related to user-generated video summarisation, and a character identification method on TV series have been developed, which is presented in the following four novel techniques:

The first contribution is to divide user-generated videos into segments, which is aimed at facilitating to video summarisation. First, tree partitioning min-hash method is used to find similar frames. Second, a full affinity graph is hierarchically split into several sparse time constrained sub graphs. Third,frames are clustered into short segments by identifying the local bundling centre. Finally, these short segments are merged into long segments. Multiple participants are asked to annotate human perceptual cuts for each video in the experimental dataset. Through comprehensive evaluation, the superiority of the proposed method is demonstrated over the state-of-the-art methods, which proves it could attain the performance close to average human level. (Chapter 3)

The second contribution is that the first framework is evolved with the joint of deep visual feature and semantic feature. First, the pre-trained CNN architecture is used to extract feature and construct visual affinity graph. Second, semantic affinity graph is constructed via a trained word embedding model. Third, the two graphs are merged into a joint matrix and given a temporal constrained matrix. Finally, frames are clustered into segments by finding the local bundling centre. Experimental results show the proposed framework outperforms all the state-of-the-art methods and exhibits high consistency. (Chapter 4)

For the third contribution, a new dataset benchmark UGSum52 is introduced for the purpose of user-generated video summarisation. To the best of the author’s knowledge, it is the largest dataset with multiple human-generated video summaries for each video. Moreover, a new unsupervised method for user-generated video summarisation is also developed. First, the input video is partitioned into segments based the deep semantic similarity with a dense-neighbour based clustering method. Second,a graph-based ranking method - FrameRank - is built to rank these segments. Finally, segments are sampled with high information scores to generate the final video summary. Based on the proposed UGSum52 and two other existing datasets, the proposed FrameRank method achieves a more effective performance than the state-of-the-art methods. (Chapter 5)

The fourth contribution is that the problem of finding all occurrences of some specified characters on TV episodes is explored. Three TV series datasets is proposed for the character identification task. An automatic identification system for finding all the frames containing the specified characters is also developed. First, an input video is segmented into shots. Second, all the persons in each shot are detected to generate recognized person-tracks and assign labels to corresponding frames with the only query image. Third, new query seeds are created from the identified images to re-rank the remaining unlabelled images, in which faces may be difficult to be detected or recognized. In addition, a dense-neighbour-based clustering method and a keyframe selection method are used to reduce the computation cost and time- spending. The comparisons of different ranking strategies are conducted, and the experimental results show that the proposed re-ranking process highly boosts the performance of the methods that can only locate the frames with faces. Furthermore, in spite of decreasing the accuracy of the system, the process of using clustering and keyframe selection methods can still achieve comparable results. (Chapter6)

A future work could be person-oriented or object-oriented summaries for user-generated videos. As mentioned in the above, Chapter 6 is a pre-step for this target. In detail, important frames can be selected to focus on some specified persons according to their detected

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Qiu, Guoping
Shen, Linlin
Valstar, Michel
Subjects: Q Science > QA Mathematics
Faculties/Schools: UNNC Ningbo, China Campus > Faculty of Science and Engineering > School of Computer Science
Item ID: 59938
Depositing User: LEI, Zhuo
Date Deposited: 20 Feb 2020 02:30
Last Modified: 06 May 2020 11:01
URI: https://eprints.nottingham.ac.uk/id/eprint/59938

Actions (Archive Staff Only)

Edit View Edit View