(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–29 of 29 results for author: Weiss, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.09434  [pdf, other

    cs.LG cs.AI cs.CE eess.SY

    A Perspective on Foundation Models for the Electric Power Grid

    Authors: Hendrik F. Hamann, Thomas Brunschwiler, Blazhe Gjorgiev, Leonardo S. A. Martins, Alban Puech, Anna Varbella, Jonas Weiss, Juan Bernabe-Moreno, Alexandre Blondin Massé, Seong Choi, Ian Foster, Bri-Mathias Hodge, Rishabh Jain, Kibaek Kim, Vincent Mai, François Mirallès, Martin De Montigny, Octavio Ramos-Leaños, Hussein Suprême, Le Xie, El-Nasser S. Youssef, Arnaud Zinflou, Alexander J. Belvi, Ricardo J. Bessa, Bishnu Prasad Bhattari , et al. (2 additional authors not shown)

    Abstract: Foundation models (FMs) currently dominate news headlines. They employ advanced deep learning architectures to extract structural information autonomously from vast datasets through self-supervision. The resulting rich representations of complex systems and dynamics can be applied to many downstream applications. Therefore, FMs can find uses in electric power grids, challenged by the energy transi… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Lead contact: H.F.H.; Major equal contributors: H.F.H., T.B., B.G., L.S.A.M., A.P., A.V., J.W.; Significant equal contributors: J.B., A.B.M., S.C., I.F., B.H., R.J., K.K., V.M., F.M., M.D.M., O.R., H.S., L.X., E.S.Y., A.Z.; Other equal contributors: A.J.B., R.J.B., B.P.B., J.S., S.S

  2. arXiv:2406.00125  [pdf

    eess.IV cs.CV cs.LG

    TotalVibeSegmentator: Full Torso Segmentation for the NAKO and UK Biobank in Volumetric Interpolated Breath-hold Examination Body Images

    Authors: Robert Graf, Paul-Sören Platzek, Evamaria Olga Riedel, Constanze Ramschütz, Sophie Starck, Hendrik Kristian Möller, Matan Atad, Henry Völzke, Robin Bülow, Carsten Oliver Schmidt, Julia Rüdebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian Löffler, Fabian Bamberg, Bene Wiestler, Johannes C. Paetzold, Daniel Rueckert, Jan Stefan Kirschke

    Abstract: Objectives: To present a publicly available torso segmentation network for large epidemiology datasets on volumetric interpolated breath-hold examination (VIBE) images. Materials & Methods: We extracted preliminary segmentations from TotalSegmentator, spine, and body composition networks for VIBE images, then improved them iteratively and retrained a nnUNet network. Using subsets of NAKO (85 subje… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: https://github.com/robert-graf/TotalVibeSegmentator

  3. arXiv:2405.19492  [pdf

    eess.IV cs.CV

    TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

    Authors: Tugba Akinci D'Antonoli, Lucas K. Berger, Ashraya K. Indrakanti, Nathan Vishwanathan, Jakob Weiß, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, Alexandra Walter, Elmar M. Merkle, Martin Segeroth, Joshy Cyriac, Shan Yang, Jakob Wasserthal

    Abstract: Purpose: To develop an open-source and easy-to-use segmentation model that can automatically and robustly segment most major anatomical structures in MR images independently of the MR sequence. Materials and Methods: In this study we extended the capabilities of TotalSegmentator to MR images. 298 MR scans and 227 CT scans were used to segment 59 anatomical structures (20 organs, 18 bones, 11 mus… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  4. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  5. arXiv:2205.11096  [pdf, other

    eess.IV cs.CV cs.LG

    FedNorm: Modality-Based Normalization in Federated Learning for Multi-Modal Liver Segmentation

    Authors: Tobias Bernecker, Annette Peters, Christopher L. Schlett, Fabian Bamberg, Fabian Theis, Daniel Rueckert, Jakob Weiß, Shadi Albarqouni

    Abstract: Given the high incidence and effective treatment options for liver diseases, they are of great socioeconomic importance. One of the most common methods for analyzing CT and MRI images for diagnosis and follow-up treatment is liver segmentation. Recent advances in deep learning have demonstrated encouraging results for automatic liver segmentation. Despite this, their success depends primarily on t… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: Under Review

  6. arXiv:2106.09660  [pdf, ps, other

    eess.AS cs.LG cs.SD

    WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

    Abstract: This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditi… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Proceedings of INTERSPEECH

  7. arXiv:2106.00847  [pdf, other

    eess.AS cs.SD

    Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

    Authors: Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

    Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. F… ▽ More

    Submitted 16 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure. WASPAA 2021

  8. arXiv:2011.03568  [pdf, other

    cs.CL cs.SD eess.AS

    Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

    Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

    Abstract: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within… ▽ More

    Submitted 5 February, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages including supplement, 3 figures. accepted to ICASSP 2021

  9. arXiv:2010.11439  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron: Non-Autoregressive and Controllable TTS

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

    Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  10. arXiv:2009.00713  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveGrad: Estimating Gradients for Waveform Generation

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan

    Abstract: This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade infere… ▽ More

    Submitted 9 October, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

  11. arXiv:2006.12701  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised Sound Separation Using Mixture Invariant Training

    Authors: Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

    Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Accepted for spotlight presentation at NeurIPS 2020

  12. arXiv:2002.03788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

    Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  13. arXiv:2002.03785  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

    Abstract: This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with a… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: to appear in ICASSP 2020

  14. arXiv:1911.13213  [pdf

    cs.LG eess.SP stat.ML

    DeStress: Deep Learning for Unsupervised Identification of Mental Stress in Firefighters from Heart-rate Variability (HRV) Data

    Authors: Ali Oskooei, Sophie Mai Chau, Jonas Weiss, Arvind Sridhar, María Rodríguez Martínez, Bruno Michel

    Abstract: In this work we perform a study of various unsupervised methods to identify mental stress in firefighter trainees based on unlabeled heart rate variability data. We collect RR interval time series data from nearly 100 firefighter trainees that participated in a drill. We explore and compare three methods in order to perform unsupervised stress detection: 1) traditional K-Means clustering with engi… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

  15. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  16. arXiv:1904.06037  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Direct speech-to-speech translation with a sequence-to-sequence model

    Authors: Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

    Abstract: We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice).… ▽ More

    Submitted 25 June, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019

  17. arXiv:1904.04169  [pdf, other

    eess.AS cs.SD

    Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

    Authors: Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia

    Abstract: We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speec… ▽ More

    Submitted 29 October, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  18. arXiv:1904.02882  [pdf, other

    cs.SD eess.AS

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Authors: Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

    Abstract: This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted for Interspeech 2019, 7 pages

  19. arXiv:1902.07178  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    A spelling correction model for end-to-end speech recognition

    Authors: Jinxi Guo, Tara N. Sainath, Ron J. Weiss

    Abstract: Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While th… ▽ More

    Submitted 19 February, 2019; originally announced February 2019.

    Comments: Accepted to ICASSP 2019

  20. arXiv:1901.08810  [pdf, other

    cs.LG eess.AS stat.ML

    Unsupervised speech representation learning using WaveNet autoencoders

    Authors: Jan Chorowski, Ron J. Weiss, Samy Bengio, Aäron van den Oord

    Abstract: We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or backgroun… ▽ More

    Submitted 11 September, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: Accepted to IEEE TASLP, final version available at http://dx.doi.org/10.1109/TASLP.2019.2938863

  21. arXiv:1811.02050  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

    Authors: Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J. Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, Yonghui Wu

    Abstract: End-to-end Speech Translation (ST) models have many potential advantages when compared to the cascade of Automatic Speech Recognition (ASR) and text Machine Translation (MT) models, including lowered inference latency and the avoidance of error compounding. However, the quality of end-to-end ST is often limited by a paucity of training data, since it is difficult to collect large parallel corpora… ▽ More

    Submitted 10 February, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: ICASSP 2019

  22. arXiv:1810.07217  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Hierarchical Generative Modeling for Controllable Speech Synthesis

    Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

    Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarch… ▽ More

    Submitted 27 December, 2018; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: 27 pages, accepted to ICLR 2019

  23. arXiv:1810.04826  [pdf, other

    eess.AS cs.LG eess.SP stat.ML

    VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

    Authors: Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

    Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe… ▽ More

    Submitted 19 June, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

    Comments: To appear in Interspeech 2019

  24. arXiv:1806.08002  [pdf, other

    cs.SD eess.AS

    Synthesizing Diverse, High-Quality Audio Textures

    Authors: Joseph Antognini, Matt Hoffman, Ron J. Weiss

    Abstract: Texture synthesis techniques based on matching the Gram matrix of feature activations in neural networks have achieved spectacular success in the image domain. In this paper we extend these techniques to the audio domain. We demonstrate that synthesizing diverse audio textures is challenging, and argue that this is because audio data is relatively low-dimensional. We therefore introduce two new te… ▽ More

    Submitted 20 June, 2018; originally announced June 2018.

    Comments: 10 pages, submitted to TASLP

  25. arXiv:1806.04558  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

    Authors: Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

    Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers… ▽ More

    Submitted 2 January, 2019; v1 submitted 12 June, 2018; originally announced June 2018.

    Comments: NeurIPS 2018

    Journal ref: Advances in Neural Information Processing Systems 31 (2018), 4485-4495

  26. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  27. arXiv:1712.08363  [pdf, other

    cs.SD eess.AS stat.ML

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Authors: Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

    Abstract: Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and t… ▽ More

    Submitted 8 March, 2018; v1 submitted 22 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  28. arXiv:1712.01769  [pdf, other

    cs.CL cs.SD eess.AS stat.ML

    State-of-the-art Speech Recognition With Sequence-to-Sequence Models

    Authors: Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

    Abstract: Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such archite… ▽ More

    Submitted 23 February, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

    Comments: ICASSP camera-ready version

  29. arXiv:1711.01694  [pdf, other

    eess.AS cs.AI cs.CL

    Multilingual Speech Recognition With A Single End-To-End Model

    Authors: Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

    Abstract: Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we presen… ▽ More

    Submitted 15 February, 2018; v1 submitted 5 November, 2017; originally announced November 2017.

    Comments: Accepted in ICASSP 2018