Combining residual networks with LSTMs for lipreading

We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging da...

Full description

Bibliographic Details
Main Authors: Stafylakis, Themos, Tzimiropoulos, Georgios
Format: Conference or Workshop Item
Published: 2017
Subjects:
Online Access:https://eprints.nottingham.ac.uk/44756/
_version_ 1848796991303188480
author Stafylakis, Themos
Tzimiropoulos, Georgios
author_facet Stafylakis, Themos
Tzimiropoulos, Georgios
author_sort Stafylakis, Themos
building Nottingham Research Data Repository
collection Online Access
description We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.
first_indexed 2025-11-14T19:56:46Z
format Conference or Workshop Item
id nottingham-44756
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:56:46Z
publishDate 2017
recordtype eprints
repository_type Digital Repository
spelling nottingham-447562020-05-04T18:46:36Z https://eprints.nottingham.ac.uk/44756/ Combining residual networks with LSTMs for lipreading Stafylakis, Themos Tzimiropoulos, Georgios We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art. 2017-05-22 Conference or Workshop Item PeerReviewed Stafylakis, Themos and Tzimiropoulos, Georgios (2017) Combining residual networks with LSTMs for lipreading. In: Interspeech 2017, 20-24 August 2017, Stockholm, Sweden. visual speech recognition lipreading deep learning http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html
spellingShingle visual speech recognition
lipreading
deep learning
Stafylakis, Themos
Tzimiropoulos, Georgios
Combining residual networks with LSTMs for lipreading
title Combining residual networks with LSTMs for lipreading
title_full Combining residual networks with LSTMs for lipreading
title_fullStr Combining residual networks with LSTMs for lipreading
title_full_unstemmed Combining residual networks with LSTMs for lipreading
title_short Combining residual networks with LSTMs for lipreading
title_sort combining residual networks with lstms for lipreading
topic visual speech recognition
lipreading
deep learning
url https://eprints.nottingham.ac.uk/44756/
https://eprints.nottingham.ac.uk/44756/