New method for mathematical modelling of human visual speech

Audio-visual speech recognition and visual speech synthesisers are used as interfaces between humans and machines. Such interactions specifically rely on the analysis and synthesis of both audio and visual information, which humans use for face-to-face communication. Currently, there is no global st...

Full description

Bibliographic Details
Main Author: Sadaghiani, Mohammad Hossein/Mr.
Format: Thesis (University of Nottingham only)
Language:English
Published: 2015
Online Access:https://eprints.nottingham.ac.uk/29159/
_version_ 1848793727363973120
author Sadaghiani, Mohammad Hossein/Mr.
author_facet Sadaghiani, Mohammad Hossein/Mr.
author_sort Sadaghiani, Mohammad Hossein/Mr.
building Nottingham Research Data Repository
collection Online Access
description Audio-visual speech recognition and visual speech synthesisers are used as interfaces between humans and machines. Such interactions specifically rely on the analysis and synthesis of both audio and visual information, which humans use for face-to-face communication. Currently, there is no global standard to describe these interactions nor is there a standard mathematical tool to describe lip movements. Furthermore, the visual lip movement for each phoneme is considered in isolation rather than a continuation from one to another. Consequently, there is no globally accepted standard method for representing lip movement during articulation. This thesis addresses these issues by designing a transcribed group of words, by mathematical formulas, and so introducing the concept of a visual word, allocating signatures to visual words and finally building a visual speech vocabulary database. In addition, visual speech information has been analysed in a novel way by considering both lip movements and phonemic structure of the English language. In order to extract the visual data, three visual features on the lip have been chosen; these are on the outer upper, lower and corner of the lip. The extracted visual data during articulation is called the visual speech sample set. The final visual data is obtained after processing the visual speech sample sets to correct experimented artefacts such as head tilting, which happened during articulation and visual data extraction. The ‘Barycentric Lagrange Interpolation’ (BLI) formulates the visual speech sample sets into visual speech signals. The visual word is defined in this work and consists of the variation of three visual features. Further processing on relating the visual speech signals to the uttered word leads to the allocation of signatures that represent the visual word. This work suggests the visual word signature can be used either as a ‘visual word barcode’, a ‘digital visual word’ or a ‘2D/3D representations’. The 2D version of the visual word provides a unique signature that allows the identification of the words being uttered. In addition, identification of visual words has also been performed using a technique called ‘volumetric representations of the visual words’. Furthermore, the effect of altering the amplitudes and sampling rate for BLI has been evaluated. In addition, the performance of BLI in reconstructing the visual speech sample sets has been considered. Finally, BLI has been compared to signal reconstruction approach by RMSE and correlation coefficients. The results show that the BLI is the more reliable method for the purpose of this work according to Section 7.7.
first_indexed 2025-11-14T19:04:54Z
format Thesis (University of Nottingham only)
id nottingham-29159
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T19:04:54Z
publishDate 2015
recordtype eprints
repository_type Digital Repository
spelling nottingham-291592025-02-28T11:35:39Z https://eprints.nottingham.ac.uk/29159/ New method for mathematical modelling of human visual speech Sadaghiani, Mohammad Hossein/Mr. Audio-visual speech recognition and visual speech synthesisers are used as interfaces between humans and machines. Such interactions specifically rely on the analysis and synthesis of both audio and visual information, which humans use for face-to-face communication. Currently, there is no global standard to describe these interactions nor is there a standard mathematical tool to describe lip movements. Furthermore, the visual lip movement for each phoneme is considered in isolation rather than a continuation from one to another. Consequently, there is no globally accepted standard method for representing lip movement during articulation. This thesis addresses these issues by designing a transcribed group of words, by mathematical formulas, and so introducing the concept of a visual word, allocating signatures to visual words and finally building a visual speech vocabulary database. In addition, visual speech information has been analysed in a novel way by considering both lip movements and phonemic structure of the English language. In order to extract the visual data, three visual features on the lip have been chosen; these are on the outer upper, lower and corner of the lip. The extracted visual data during articulation is called the visual speech sample set. The final visual data is obtained after processing the visual speech sample sets to correct experimented artefacts such as head tilting, which happened during articulation and visual data extraction. The ‘Barycentric Lagrange Interpolation’ (BLI) formulates the visual speech sample sets into visual speech signals. The visual word is defined in this work and consists of the variation of three visual features. Further processing on relating the visual speech signals to the uttered word leads to the allocation of signatures that represent the visual word. This work suggests the visual word signature can be used either as a ‘visual word barcode’, a ‘digital visual word’ or a ‘2D/3D representations’. The 2D version of the visual word provides a unique signature that allows the identification of the words being uttered. In addition, identification of visual words has also been performed using a technique called ‘volumetric representations of the visual words’. Furthermore, the effect of altering the amplitudes and sampling rate for BLI has been evaluated. In addition, the performance of BLI in reconstructing the visual speech sample sets has been considered. Finally, BLI has been compared to signal reconstruction approach by RMSE and correlation coefficients. The results show that the BLI is the more reliable method for the purpose of this work according to Section 7.7. 2015-07-25 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/29159/1/New%20Method%20for%20Mathematical%20Modelling%20of%20Human%20Visual%20Speech.pdf Sadaghiani, Mohammad Hossein/Mr. (2015) New method for mathematical modelling of human visual speech. PhD thesis, University of Nottingham.
spellingShingle Sadaghiani, Mohammad Hossein/Mr.
New method for mathematical modelling of human visual speech
title New method for mathematical modelling of human visual speech
title_full New method for mathematical modelling of human visual speech
title_fullStr New method for mathematical modelling of human visual speech
title_full_unstemmed New method for mathematical modelling of human visual speech
title_short New method for mathematical modelling of human visual speech
title_sort new method for mathematical modelling of human visual speech
url https://eprints.nottingham.ac.uk/29159/