Towards vocal tract MRI synthesis from facial signals using external to internal correlation modelling

Oral health underpins everyday functions such as speech, mastication and swallowing, yet acquiring detailed kinematic data on the vocal tract remains technically and financially demanding. Ultrasound and electromagnetic articulography offer only partial coverage, while Real Time Magnetic Resonance I...

Full description

Bibliographic Details
Main Author: Shahid, Muhammad Suhaib
Format: Thesis (University of Nottingham only)
Language:English
Published: 2025
Subjects:
Online Access:https://eprints.nottingham.ac.uk/81139/
Description
Summary:Oral health underpins everyday functions such as speech, mastication and swallowing, yet acquiring detailed kinematic data on the vocal tract remains technically and financially demanding. Ultrasound and electromagnetic articulography offer only partial coverage, while Real Time Magnetic Resonance Imaging (RtMRI) data delivers richer information but requires expensive scanners and bespoke acquisition protocols. These constraints limit large-scale studies and the routine use of dynamic vocal-tract models in both research and clinical practice. Motivated by the need for an affordable, non-invasive alternative, this thesis introduces External to Internal Correlation Modelling (E2ICM), a novel framework that learns correlations between external facial signals and internal articulator motion, enabling vocal-tract modelling without direct imaging. The work pursues four objectives: (i) advanced segmentation of RtMRI sequences, (ii) quantification of articulator interdependencies, (iii) prediction of internal motion from purely external observations, and (iv) ethical evaluation of AI-driven approaches in oral healthcare. Both static and temporal segmentation pipelines are developed for RtMRI data. Generative adversarial networks and diffusion models are then employed to synthesise internal views from facial video, addressing data scarcity through tailored augmentation strategies. A thematic analysis of professional interviews highlights concerns around privacy, security and algorithmic bias, informing an ethical framework for clinical deployment. A key contribution is a dual-view dataset comprising synchronised high-resolution RtMRI and external video captured during controlled speech and chewing tasks. Experimental results demonstrate that (E2ICM can predict vocal-tract configurations with promising accuracy while reducing reliance on costly imaging. Improved segmentation techniques and a deeper understanding of articulator dynamics further advance the state of the art in non-invasive oral-movement modelling.