Deep learning-based audio-visual speech recognition for Bosnian digits

This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two comp...

Full description

Bibliographic Details
Main Authors: Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir
Format: Article
Language:English
Published: Penerbit Universiti Kebangsaan Malaysia 2024
Online Access:http://journalarticle.ukm.my/25132/
http://journalarticle.ukm.my/25132/1/14.pdf
_version_ 1848816277970223104
author Husein Fazlić,
Ali Abd Almisreb,
Nooritawati Md Tahir,
author_facet Husein Fazlić,
Ali Abd Almisreb,
Nooritawati Md Tahir,
author_sort Husein Fazlić,
building UKM Institutional Repository
collection Online Access
description This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets.
first_indexed 2025-11-15T01:03:20Z
format Article
id oai:generic.eprints.org:25132
institution Universiti Kebangasaan Malaysia
institution_category Local University
language English
last_indexed 2025-11-15T01:03:20Z
publishDate 2024
publisher Penerbit Universiti Kebangsaan Malaysia
recordtype eprints
repository_type Digital Repository
spelling oai:generic.eprints.org:251322025-05-26T07:51:25Z http://journalarticle.ukm.my/25132/ Deep learning-based audio-visual speech recognition for Bosnian digits Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir, This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets. Penerbit Universiti Kebangsaan Malaysia 2024 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/25132/1/14.pdf Husein Fazlić, and Ali Abd Almisreb, and Nooritawati Md Tahir, (2024) Deep learning-based audio-visual speech recognition for Bosnian digits. Jurnal Kejuruteraan, 36 (1). pp. 147-154. ISSN 0128-0198 https://www.ukm.my/jkukm/volume-3601-2024
spellingShingle Husein Fazlić,
Ali Abd Almisreb,
Nooritawati Md Tahir,
Deep learning-based audio-visual speech recognition for Bosnian digits
title Deep learning-based audio-visual speech recognition for Bosnian digits
title_full Deep learning-based audio-visual speech recognition for Bosnian digits
title_fullStr Deep learning-based audio-visual speech recognition for Bosnian digits
title_full_unstemmed Deep learning-based audio-visual speech recognition for Bosnian digits
title_short Deep learning-based audio-visual speech recognition for Bosnian digits
title_sort deep learning-based audio-visual speech recognition for bosnian digits
url http://journalarticle.ukm.my/25132/
http://journalarticle.ukm.my/25132/
http://journalarticle.ukm.my/25132/1/14.pdf