Deep word embeddings for visual speech recognition

In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and...

Full description

Bibliographic Details
Main Authors: Stafylakis, Themos, Tzimiropoulos, Georgios
Format: Conference or Workshop Item
Language:English
Published: 2018
Subjects:
Online Access:https://eprints.nottingham.ac.uk/51133/
_version_ 1848798424524128256
author Stafylakis, Themos
Tzimiropoulos, Georgios
author_facet Stafylakis, Themos
Tzimiropoulos, Georgios
author_sort Stafylakis, Themos
building Nottingham Research Data Repository
collection Online Access
description In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set.
first_indexed 2025-11-14T20:19:33Z
format Conference or Workshop Item
id nottingham-51133
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T20:19:33Z
publishDate 2018
recordtype eprints
repository_type Digital Repository
spelling nottingham-511332018-04-15T05:12:15Z https://eprints.nottingham.ac.uk/51133/ Deep word embeddings for visual speech recognition Stafylakis, Themos Tzimiropoulos, Georgios In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set. 2018-04-15 Conference or Workshop Item PeerReviewed application/pdf en https://eprints.nottingham.ac.uk/51133/1/av_speech2.pdf Stafylakis, Themos and Tzimiropoulos, Georgios (2018) Deep word embeddings for visual speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 15-20 April 2018, Calgary, Alberta, Canada. Visual Speech Recognition Lipreading Word Embeddings Deep Learning Low-shot Learning
spellingShingle Visual Speech Recognition
Lipreading
Word Embeddings
Deep Learning
Low-shot Learning
Stafylakis, Themos
Tzimiropoulos, Georgios
Deep word embeddings for visual speech recognition
title Deep word embeddings for visual speech recognition
title_full Deep word embeddings for visual speech recognition
title_fullStr Deep word embeddings for visual speech recognition
title_full_unstemmed Deep word embeddings for visual speech recognition
title_short Deep word embeddings for visual speech recognition
title_sort deep word embeddings for visual speech recognition
topic Visual Speech Recognition
Lipreading
Word Embeddings
Deep Learning
Low-shot Learning
url https://eprints.nottingham.ac.uk/51133/