Combining residual networks with LSTMs for lipreading

Stafylakis, Themos and Tzimiropoulos, Georgios (2017) Combining residual networks with LSTMs for lipreading. In: Interspeech 2017, 20-24 August 2017, Stockholm, Sweden. (In Press)

[img] PDF - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB)


We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.

Item Type: Conference or Workshop Item (Paper)
Keywords: visual speech recognition, lipreading, deep learning
Schools/Departments: University of Nottingham, UK > Faculty of Science > School of Computer Science
Related URLs:
Depositing User: Tzimiropoulos, Yorgos
Date Deposited: 10 Aug 2017 11:09
Last Modified: 11 Aug 2017 03:35

Actions (Archive Staff Only)

Edit View Edit View