The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge
Authors:
Meng Ge,
Yizhou Peng,
Yidi Jiang,
Jingru Lin,
Junyi Ao,
Mehmet Sinan Yildirim,
Shuai Wang,
Haizhou Li,
Mengling Feng
Abstract:
This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves…
▽ More
This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves a relative 34.3% improvement in CER and 56.5% improvement in cpCER, compared to the offical baseline system.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
USED: Universal Speaker Extraction and Diarization
Authors:
Junyi Ao,
Mehmet Sinan Yıldırım,
Ruijie Tao,
Meng Ge,
Shuai Wang,
Yanmin Qian,
Haizhou Li
Abstract:
Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to hav…
▽ More
Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about `who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified model called Universal Speaker Extraction and Diarization (USED) to address output inconsistency and scenario mismatch issues. It is designed to manage speech mixture with varying overlap ratios and variable number of speakers. We show that the USED model significantly outperforms the competitive baselines for speaker extraction and diarization tasks on LibriMix and SparseLibriMix datasets. We further validate the diarization performance on CALLHOME, a dataset based on real recordings, and experimental results indicate that our model surpasses recently proposed approaches.
△ Less
Submitted 9 May, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.