Deep learning-based audio-visual speech recognition for Bosnian digits
This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two comp...
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Penerbit Universiti Kebangsaan Malaysia
2024
|
| Online Access: | http://journalarticle.ukm.my/25132/ http://journalarticle.ukm.my/25132/1/14.pdf |
| _version_ | 1848816277970223104 |
|---|---|
| author | Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir, |
| author_facet | Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir, |
| author_sort | Husein Fazlić, |
| building | UKM Institutional Repository |
| collection | Online Access |
| description | This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task
posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to
building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study
demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian
digits, despite the challenge of limited Bosnian language datasets. |
| first_indexed | 2025-11-15T01:03:20Z |
| format | Article |
| id | oai:generic.eprints.org:25132 |
| institution | Universiti Kebangasaan Malaysia |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-15T01:03:20Z |
| publishDate | 2024 |
| publisher | Penerbit Universiti Kebangsaan Malaysia |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | oai:generic.eprints.org:251322025-05-26T07:51:25Z http://journalarticle.ukm.my/25132/ Deep learning-based audio-visual speech recognition for Bosnian digits Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir, This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets. Penerbit Universiti Kebangsaan Malaysia 2024 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/25132/1/14.pdf Husein Fazlić, and Ali Abd Almisreb, and Nooritawati Md Tahir, (2024) Deep learning-based audio-visual speech recognition for Bosnian digits. Jurnal Kejuruteraan, 36 (1). pp. 147-154. ISSN 0128-0198 https://www.ukm.my/jkukm/volume-3601-2024 |
| spellingShingle | Husein Fazlić, Ali Abd Almisreb, Nooritawati Md Tahir, Deep learning-based audio-visual speech recognition for Bosnian digits |
| title | Deep learning-based audio-visual speech recognition for Bosnian digits |
| title_full | Deep learning-based audio-visual speech recognition for Bosnian digits |
| title_fullStr | Deep learning-based audio-visual speech recognition for Bosnian digits |
| title_full_unstemmed | Deep learning-based audio-visual speech recognition for Bosnian digits |
| title_short | Deep learning-based audio-visual speech recognition for Bosnian digits |
| title_sort | deep learning-based audio-visual speech recognition for bosnian digits |
| url | http://journalarticle.ukm.my/25132/ http://journalarticle.ukm.my/25132/ http://journalarticle.ukm.my/25132/1/14.pdf |