New method for mathematical modelling of human visual speech

Sadaghiani, Mohammad Hossein/Mr. (2015) New method for mathematical modelling of human visual speech. PhD thesis, University of Nottingham.

PDF (Thesis - as examined) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (23MB) | Preview


Audio-visual speech recognition and visual speech synthesisers are used as interfaces between humans and machines. Such interactions specifically rely on the analysis and synthesis of both audio and visual information, which humans use for face-to-face communication. Currently, there is no global standard to describe these interactions nor is there a standard mathematical tool to describe lip movements. Furthermore, the visual lip movement for each phoneme is considered in isolation rather than a continuation from one to another. Consequently, there is no globally accepted standard method for representing lip movement during articulation. This thesis addresses these issues by designing a transcribed group of words, by mathematical formulas, and so introducing the concept of a visual word, allocating signatures to visual words and finally building a visual speech vocabulary database. In addition, visual speech information has been analysed in a novel way by considering both lip movements and phonemic structure of the English language. In order to extract the visual data, three visual features on the lip have been chosen; these are on the outer upper, lower and corner of the lip. The extracted visual data during articulation is called the visual speech sample set. The final visual data is obtained after processing the visual speech sample sets to correct experimented artefacts such as head tilting, which happened during articulation and visual data extraction. The ‘Barycentric Lagrange Interpolation’ (BLI) formulates the visual speech sample sets into visual speech signals. The visual word is defined in this work and consists of the variation of three visual features. Further processing on relating the visual speech signals to the uttered word leads to the allocation of signatures that represent the visual word. This work suggests the visual word signature can be used either as a ‘visual word barcode’, a ‘digital visual word’ or a ‘2D/3D representations’. The 2D version of the visual word provides a unique signature that allows the identification of the words being uttered. In addition, identification of visual words has also been performed using a technique called ‘volumetric representations of the visual words’. Furthermore, the effect of altering the amplitudes and sampling rate for BLI has been evaluated. In addition, the performance of BLI in reconstructing the visual speech sample sets has been considered. Finally, BLI has been compared to signal reconstruction approach by RMSE and correlation coefficients. The results show that the BLI is the more reliable method for the purpose of this work according to Section 7.7.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Harrison, Ian
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800 Electronics
Faculties/Schools: University of Nottingham, Malaysia > Faculty of Science and Engineering — Engineering > Department of Electrical and Electronic Engineering
Item ID: 29159
Date Deposited: 08 Apr 2016 10:49
Last Modified: 15 Jul 2021 14:06

Actions (Archive Staff Only)

Edit View Edit View