Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual...

Full description

Bibliographic Details
Main Authors: Eyben, F., Petridis, S., Schuller, Björn, Tzimiropoulos, Georgios, Zafeiriou, Stefanos, Pantic, Maja
Format: Conference or Workshop Item
Published: 2011
Subjects:
Online Access:https://eprints.nottingham.ac.uk/31428/
Description
Summary:We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.