(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–28 of 28 results for author: Vu, N T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.02937  [pdf, other

    cs.CL cs.SD eess.AS

    Probing the Feasibility of Multilingual Speaker Anonymization

    Authors: Sarina Meyer, Florian Lux, Ngoc Thang Vu

    Abstract: In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependen… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: accepted at Interspeech 2024

  2. arXiv:2406.06406  [pdf, other

    cs.CL cs.SD eess.AS

    Controlling Emotion in Text-to-Speech with Natural Language Prompts

    Authors: Thomas Bott, Florian Lux, Ngoc Thang Vu

    Abstract: In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points wi… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  3. arXiv:2406.06403  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Meta Learning Text-to-Speech Synthesis in over 7000 Languages

    Authors: Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

    Abstract: In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech syn… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  4. arXiv:2404.10922  [pdf, other

    cs.CL cs.SD eess.AS

    Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness th… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: NAACL Findings 2024

  5. Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

    Authors: Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu

    Abstract: Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intui… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Published at ISCA Interspeech 2023 https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html

  6. arXiv:2310.17499  [pdf, other

    cs.CL cs.LG eess.AS

    The IMS Toucan System for the Blizzard Challenge 2023

    Authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

    Abstract: For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synt… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023

  7. arXiv:2310.06103  [pdf, other

    cs.CL cs.SD eess.AS

    Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2023

  8. VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

    Authors: Sarina Meyer, Xiaoxiao Miao, Ngoc Thang Vu

    Abstract: Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity… ▽ More

    Submitted 21 December, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by OJSP-ICASSP 2024 https://ieeexplore.ieee.org/document/10365329

  9. arXiv:2304.04478  [pdf, other

    cs.CL cs.SD eess.AS

    Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions

    Authors: Daniel Ortega, Chia-Yu Li, Ngoc Thang Vu

    Abstract: This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at developing a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of u… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

    Comments: Published in ICASSP 2020

  10. arXiv:2210.12223  [pdf, other

    cs.CL cs.SD eess.AS

    Low-Resource Multilingual and Zero-Shot Multispeaker TTS

    Authors: Florian Lux, Julia Koch, Ngoc Thang Vu

    Abstract: While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnosti… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: Accepted to AACL 2022

  11. arXiv:2210.11642  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Semi-supervised End-to-end Automatic Speech Recognition using CycleGAN and Inter-domain Losses

    Authors: Chia-Yu Li, Ngoc Thang Vu

    Abstract: We propose a novel method that combines CycleGAN and inter-domain losses for semi-supervised end-to-end automatic speech recognition. Inter-domain loss targets the extraction of an intermediate shared representation of speech and text inputs using a shared network. CycleGAN uses cycle-consistent loss and the identity mapping loss to preserve relevant characteristics of the input feature after conv… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: 6 pages + 2 references, 6 figures, accepted by SLT2022

  12. arXiv:2210.07002  [pdf, other

    cs.SD cs.CL eess.AS

    Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

    Authors: Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, Ngoc Thang Vu

    Abstract: In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings. This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications. One of the challenges in this context is to create non-existent voices that sound as natural as possi… ▽ More

    Submitted 20 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: IEEE Spoken Language Technology Workshop 2022

  13. arXiv:2207.05549  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PoeticTTS -- Controllable Poetry Reading for Literary Studies

    Authors: Julia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu

    Abstract: Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we propose an approach to synthesise poems with almost human like naturalness in order to enable literary scholars to systematically examine hypotheses on the interplay between text, spoken realisation, and the listener's perception of poems. To meet these special requirements fo… ▽ More

    Submitted 18 October, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: Presented at Interspeech 2022

  14. arXiv:2207.04834  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Speaker Anonymization with Phonetic Intermediate Representations

    Authors: Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli, Ngoc Thang Vu

    Abstract: In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic co… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  15. arXiv:2206.12229  [pdf, other

    cs.SD cs.CL eess.AS

    Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

    Authors: Florian Lux, Julia Koch, Ngoc Thang Vu

    Abstract: The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level spe… ▽ More

    Submitted 21 October, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted to IEEE SLT 2022

  16. arXiv:2203.03191  [pdf, other

    cs.CL cs.LG eess.AS

    Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

    Authors: Florian Lux, Ngoc Thang Vu

    Abstract: While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6,000 spoken languages in the world due to a lack of appropriate training data. In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages.… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: Accepted for the ACL 2022 main conference

  17. arXiv:2112.10202  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

    Authors: Chia-Yu Li, Ngoc Thang Vu

    Abstract: Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech. We analyse different CS specific issues such as the properties mismatches between languages in a CS language pair, the unpredictable nature of switching… ▽ More

    Submitted 19 December, 2021; originally announced December 2021.

    Comments: The 2019 International Conference on Asian Language Processing (IALP)

  18. arXiv:2112.10108  [pdf, other

    cs.CL cs.LG eess.AS

    Investigation of Densely Connected Convolutional Networks with Domain Adversarial Learning for Noise Robust Speech Recognition

    Authors: Chia Yu Li, Ngoc Thang Vu

    Abstract: We investigate densely connected convolutional networks (DenseNets) and their extension with domain adversarial training for noise robust speech recognition. DenseNets are very deep, compact convolutional neural networks which have demonstrated incredible improvements over the state-of-the-art results in computer vision. Our experimental results reveal that DenseNets are more robust against noise… ▽ More

    Submitted 19 December, 2021; originally announced December 2021.

    Comments: 7 pages, 5 figures, The 30th Conference on Electronic Speech Signal Processing (ESSV2019)

  19. arXiv:2112.06309  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN

    Authors: Chia-Yu Li, Ngoc Thang Vu

    Abstract: This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement. We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech and therefore improve the automatic speech recognition performance. Our proposed method leverages the CycleGAN framework for speech enhancement without any parallel data and imp… ▽ More

    Submitted 12 December, 2021; originally announced December 2021.

    Comments: 6 pages, 9 figures, ASRU 2021

  20. arXiv:2111.14706  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

    Authors: Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

    Abstract: As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can b… ▽ More

    Submitted 3 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

    Comments: Accepted at ICASSP 2022 (5 pages)

  21. arXiv:2106.16055  [pdf, ps, other

    cs.CL cs.SD eess.AS

    IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task

    Authors: Pavel Denisov, Manuel Mager, Ngoc Thang Vu

    Abstract: This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models combined with several data augmentation, multi-task and transfer learning approaches for the automatic speech recognition (ASR) and machine translation (MT) steps of our cascaded system. Moreover, we also explore the feasibility of a full end-to-end spee… ▽ More

    Submitted 30 June, 2021; originally announced June 2021.

    Comments: IWSLT 2021

  22. arXiv:2103.01894  [pdf, other

    cs.SD cs.CL eess.AS

    Investigations on Audiovisual Emotion Recognition in Noisy Conditions

    Authors: Michael Neumann, Ngoc Thang Vu

    Abstract: In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features. We attempt to answer the following research questions: (i) How does speech emotion recognition perform on noisy data? and (ii) To what extend does a multimodal approach improve the accuracy and compensate for potential performance degradation at different noise levels? We prese… ▽ More

    Submitted 2 March, 2021; originally announced March 2021.

    Comments: Published at the IEEE workshop on Spoken Language Technology (SLT) 2021

  23. arXiv:2102.12624  [pdf, other

    eess.AS cs.SD

    Meta-Learning for improving rare word recognition in end-to-end ASR

    Authors: Florian Lux, Ngoc Thang Vu

    Abstract: We propose a new method of generating meaningful embeddings for speech, changes to four commonly used meta learning approaches to enable them to perform keyword spotting in continuous signals and an approach of combining their outcomes into an end-to-end automatic speech recognition system to improve rare word recognition. We verify the functionality of each of our three contributions in two exper… ▽ More

    Submitted 24 February, 2021; originally announced February 2021.

    Comments: Revised version to be published in the proceedings of ICASSP 2021

  24. arXiv:2102.10663  [pdf, other

    eess.IV cs.CV cs.LG

    MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation

    Authors: Yen Nhi Truong Vu, Richard Wang, Niranjan Balachandar, Can Liu, Andrew Y. Ng, Pranav Rajpurkar

    Abstract: Self-supervised contrastive learning between pairs of multiple views of the same image has been shown to successfully leverage unlabeled data to produce meaningful visual representations for both natural and medical images. However, there has been limited work on determining how to select pairs for medical images, where availability of patient metadata can be leveraged to improve representations.… ▽ More

    Submitted 17 October, 2021; v1 submitted 21 February, 2021; originally announced February 2021.

  25. arXiv:2007.01836  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. These components are optimized independently to allow usage of available data, but the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings to process acoustic feat… ▽ More

    Submitted 11 August, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

    Comments: Interspeech 2020

  26. arXiv:1908.04743  [pdf, ps, other

    cs.CL cs.SD eess.AS

    IMS-Speech: A Speech to Text Tool

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials. This tool is based on modern open source software stack, advanced speech recognition methods and public data resources and is freely available for academic researchers. The utilized m… ▽ More

    Submitted 13 August, 2019; originally announced August 2019.

    Comments: ESSV 2019

  27. arXiv:1908.04737  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers. Our experimenta… ▽ More

    Submitted 13 August, 2019; originally announced August 2019.

    Comments: Interspeech 2019

  28. arXiv:1807.11284  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Unsupervised Domain Adaptation by Adversarial Learning for Robust Speech Recognition

    Authors: Pavel Denisov, Ngoc Thang Vu, Marc Ferras Font

    Abstract: In this paper, we investigate the use of adversarial learning for unsupervised adaptation to unseen recording conditions, more specifically, single microphone far-field speech. We adapt neural networks based acoustic models trained with close-talk clean speech to the new recording conditions using untranscribed adaptation data. Our experimental results on Italian SPEECON data set show that our pro… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

    Comments: 5 pages, 2 figures, the 13th ITG conference on Speech Communication