Combining residual networks with LSTMs for lipreading
We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging da...
| Main Authors: | , |
|---|---|
| Format: | Conference or Workshop Item |
| Published: |
2017
|
| Subjects: | |
| Online Access: | https://eprints.nottingham.ac.uk/44756/ |
| _version_ | 1848796991303188480 |
|---|---|
| author | Stafylakis, Themos Tzimiropoulos, Georgios |
| author_facet | Stafylakis, Themos Tzimiropoulos, Georgios |
| author_sort | Stafylakis, Themos |
| building | Nottingham Research Data Repository |
| collection | Online Access |
| description | We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art. |
| first_indexed | 2025-11-14T19:56:46Z |
| format | Conference or Workshop Item |
| id | nottingham-44756 |
| institution | University of Nottingham Malaysia Campus |
| institution_category | Local University |
| last_indexed | 2025-11-14T19:56:46Z |
| publishDate | 2017 |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | nottingham-447562020-05-04T18:46:36Z https://eprints.nottingham.ac.uk/44756/ Combining residual networks with LSTMs for lipreading Stafylakis, Themos Tzimiropoulos, Georgios We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art. 2017-05-22 Conference or Workshop Item PeerReviewed Stafylakis, Themos and Tzimiropoulos, Georgios (2017) Combining residual networks with LSTMs for lipreading. In: Interspeech 2017, 20-24 August 2017, Stockholm, Sweden. visual speech recognition lipreading deep learning http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html |
| spellingShingle | visual speech recognition lipreading deep learning Stafylakis, Themos Tzimiropoulos, Georgios Combining residual networks with LSTMs for lipreading |
| title | Combining residual networks with LSTMs for lipreading |
| title_full | Combining residual networks with LSTMs for lipreading |
| title_fullStr | Combining residual networks with LSTMs for lipreading |
| title_full_unstemmed | Combining residual networks with LSTMs for lipreading |
| title_short | Combining residual networks with LSTMs for lipreading |
| title_sort | combining residual networks with lstms for lipreading |
| topic | visual speech recognition lipreading deep learning |
| url | https://eprints.nottingham.ac.uk/44756/ https://eprints.nottingham.ac.uk/44756/ |