Speech features analysis of the joint speech separation and automatic speech recognition model / Tawseef Khan

Speech recognition of target speakers from a mixture of voiced noises from interfering speakers in a single channel is a complex task. This is because the speech signal pattern of both the target and interfering speakers are similar and can be challenging to distinguish from one another. If the targ...

Full description

Bibliographic Details
Main Author: Tawseef , Khan
Format: Thesis
Published: 2021
Subjects:
Online Access:http://studentsrepo.um.edu.my/12942/
http://studentsrepo.um.edu.my/12942/1/Tawseef_Khan.pdf
http://studentsrepo.um.edu.my/12942/2/Tawseef_Khan.pdf
Description
Summary:Speech recognition of target speakers from a mixture of voiced noises from interfering speakers in a single channel is a complex task. This is because the speech signal pattern of both the target and interfering speakers are similar and can be challenging to distinguish from one another. If the target speaker’s speech can be correctly identified, such a system can be used in interviews, courtrooms, transcribing video subtitles, etc. During conversations between multiple speakers, it is common for the voices to overlap. In such cases, it is important to separate the speech of the target speaker based on one single audio signal. To date, ASR models are good at recognizing lexical data in white/background noises though they are unable to perform well with other voiced noises. Recently a joint speech separation and ASR model was proposed that can handle both the task of speech separation and recognition into one component in an end-to-end fashion. Two key factors affecting the accuracy of ASR models are the type of features used to build the model and the signal-to-noise ratio (SNR) of the target signal. This research compares different features to find the optimum features for the joint speech separation and ASR model at different SNR levels. Ten features that were previously used in speech separation of voiced noise have been used to test the accuracy of the model at SNR levels -10, -5, 0, 5, +5 (dB). The experiment evaluates the Word Error Rate (WER) of Speech separation and ASR separately within the joint speech separation and ASR model. Ten features that were used for speech separation in previous studies were evaluated, which are STFT, LOG-POW, LOG-MEL, LOG-MAG, GF, GFCC, MFCC, PNCC, RASTA-PLP (Relative Spectral - Perceptual Predictive), and AMS. At SNR level -10, GF and GFCC was found to have the lowest WER. For SNR levels -5, 0, 5, 10 the lowest WER was achieved by GF, PNCC, STFT, and GF.