Self-supervised learning for automatic speech recognition In low-resource environments

Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a signifi...

Full description

Bibliographic Details
Main Author:	Fatehi, Kavan
Format:	Thesis (University of Nottingham only)
Language:	English
Published:	2024
Subjects:	Automatic Speech Recognition Low-resource Environment Self-Supervised Learning
Online Access:	https://eprints.nottingham.ac.uk/77884/

_version_	1848801031138312192
author	Fatehi, Kavan
author_facet	Fatehi, Kavan
author_sort	Fatehi, Kavan
building	Nottingham Research Data Repository
collection	Online Access
description	Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a significant challenge arises from the heavy reliance on extensive annotated data for training these systems. This reliance poses a significant scalability limitation, hindering the continual enhancement of state-of-the-art performance. Moreover, it presents a more fundamental obstacle for deploying deep neural networks in speech-related domains where acquiring labeled data is inherently arduous, expensive, or time-intensive, which are considered as low-resource ASR problems in this thesis. Unlike annotated speech data, collecting untranscribed audio is typically more cost-effective. In this thesis, we investigate the application of self-supervised learning in low-resource tasks, a learning approach where the learning objective is derived directly from the input data itself. We employ this method to harness the scalability and affordability of untranscribed audio resources in problems where we do not have enough training data, with the goal of enhancing the performance of spoken language technology. In particular, we propose three self-supervised methodologies. One model is based on the concept of two-fine-tuning steps, while the other two revolve around the notion of identifying an improved hidden unit. These approaches are designed to learn contextualized speech representations from speech data lacking annotations. We demonstrate the capacity of our self-supervised techniques to learn representations that convert the higher-level characteristics of speech signals more effectively than conventional acoustic features. Additionally, we present how these representations enhance the performance of deep neural networks on ASR tasks with limited resources. Beyond introducing novel learning algorithms, we conduct in-depth analyses to comprehend the properties of the acquired self-supervised representations and elucidate the distinct design elements that separate one self-supervised model from another.
first_indexed	2025-11-14T21:00:59Z
format	Thesis (University of Nottingham only)
id	nottingham-77884
institution	University of Nottingham Malaysia Campus
institution_category	Local University
language	English
last_indexed	2025-11-14T21:00:59Z
publishDate	2024
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-778842024-07-23T04:40:25Z https://eprints.nottingham.ac.uk/77884/ Self-supervised learning for automatic speech recognition In low-resource environments Fatehi, Kavan Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a significant challenge arises from the heavy reliance on extensive annotated data for training these systems. This reliance poses a significant scalability limitation, hindering the continual enhancement of state-of-the-art performance. Moreover, it presents a more fundamental obstacle for deploying deep neural networks in speech-related domains where acquiring labeled data is inherently arduous, expensive, or time-intensive, which are considered as low-resource ASR problems in this thesis. Unlike annotated speech data, collecting untranscribed audio is typically more cost-effective. In this thesis, we investigate the application of self-supervised learning in low-resource tasks, a learning approach where the learning objective is derived directly from the input data itself. We employ this method to harness the scalability and affordability of untranscribed audio resources in problems where we do not have enough training data, with the goal of enhancing the performance of spoken language technology. In particular, we propose three self-supervised methodologies. One model is based on the concept of two-fine-tuning steps, while the other two revolve around the notion of identifying an improved hidden unit. These approaches are designed to learn contextualized speech representations from speech data lacking annotations. We demonstrate the capacity of our self-supervised techniques to learn representations that convert the higher-level characteristics of speech signals more effectively than conventional acoustic features. Additionally, we present how these representations enhance the performance of deep neural networks on ASR tasks with limited resources. Beyond introducing novel learning algorithms, we conduct in-depth analyses to comprehend the properties of the acquired self-supervised representations and elucidate the distinct design elements that separate one self-supervised model from another. 2024-07-23 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en cc_by https://eprints.nottingham.ac.uk/77884/1/Fatehi%2CKavan%2C20167617%2Ccorrections.pdf Fatehi, Kavan (2024) Self-supervised learning for automatic speech recognition In low-resource environments. PhD thesis, University of Nottingham. Automatic Speech Recognition Low-resource Environment Self-Supervised Learning
spellingShingle	Automatic Speech Recognition Low-resource Environment Self-Supervised Learning Fatehi, Kavan Self-supervised learning for automatic speech recognition In low-resource environments
title	Self-supervised learning for automatic speech recognition In low-resource environments
title_full	Self-supervised learning for automatic speech recognition In low-resource environments
title_fullStr	Self-supervised learning for automatic speech recognition In low-resource environments
title_full_unstemmed	Self-supervised learning for automatic speech recognition In low-resource environments
title_short	Self-supervised learning for automatic speech recognition In low-resource environments
title_sort	self-supervised learning for automatic speech recognition in low-resource environments
topic	Automatic Speech Recognition Low-resource Environment Self-Supervised Learning
url	https://eprints.nottingham.ac.uk/77884/

Self-supervised learning for automatic speech recognition In low-resource environments

Similar Items