Minimum Word Error Rate Training for Speech Separation
MetadataShow full item record
- Master's theses (TN-IDE) 
The cocktail party problem, also known as a single-channel multi-talker problem, is a significant challenge to enhance the performance of automatic speech recognition (ASR) systems. Most existing speech separation model only concerns the signal-level performance, i.e., source-to-distortion ratio (SDR), via their cost/loss function, not a transcription-level performance. However, transcription-level measurement, such as word error rate (WER) is the ultimate measurement that can be used in the performance of ASR. Therefore we propose a new loss function that can directly consider both signal and transcription level performance with integrating both speech separation and speech recognition system. Moreover, we suggest the generalized integration architecture that can be applied to any combination of speech recognition/separation system regardless of their system environment. In this thesis, first, we review the techniques from the primary signal processing knowledge to deep learning techniques and introduce the detailed target and challenge in speech separation problem. Moreover, we analyze the several famous speech separation models derived from a deep learning approach. Then we introduce the new loss function with our detailed system architecture, including the step-by-step process from pre-processing to evaluation. We improve the performance of the existing model using our training approach. Our solution enhances average SDR from 0.10dB to 4.09dB and average WER from 92.7% to 55.7% using LibriSpeech dataset.
Master's thesis in Computer science