-
Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024
Authors:
Sai Koneru,
Thai-Binh Nguyen,
Ngoc-Quan Pham,
Danni Liu,
Zhaolin Li,
Alexander Waibel,
Jan Niehues
Abstract:
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we inte…
▽ More
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Authors:
Tu Anh Dinh,
Carlos Mullov,
Leonard Bärmann,
Zhaolin Li,
Danni Liu,
Simon Reiß,
Jueun Lee,
Nathan Lerzer,
Fabian Ternava,
Jianfeng Gao,
Tobias Röddiger,
Alexander Waibel,
Tamim Asfour,
Michael Beigl,
Rainer Stiefelhagen,
Carsten Dachsbacher,
Klemens Böhm,
Jan Niehues
Abstract:
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx -…
▽ More
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
△ Less
Submitted 12 July, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Authors:
Matthias Sperber,
Ondřej Bojar,
Barry Haddow,
Dávid Javorský,
Xutai Ma,
Matteo Negri,
Jan Niehues,
Peter Polák,
Elizabeth Salesky,
Katsuhito Sudoh,
Marco Turchi
Abstract:
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation…
▽ More
Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation
Authors:
Tu Anh Dinh,
Tobias Palzer,
Jan Niehues
Abstract:
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is n…
▽ More
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Language-Independent Representations Improve Zero-Shot Summarization
Authors:
Vladimir Solovyev,
Danni Liu,
Jan Niehues
Abstract:
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are h…
▽ More
Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance. Next, we propose query-key (QK) finetuning to decouple task-specific knowledge from the pretrained language generation abilities. Then, after showing downsides of the standard adversarial language classifier, we propose a balanced variant that more directly enforces language-agnostic representations. Moreover, our qualitative analyses show removing source language identity correlates to zero-shot summarization performance. Our code is openly available.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Resonant Solitary States in Complex Networks
Authors:
Jakob Niehues,
Serhiy Yanchuk,
Rico Berner,
Jürgen Kurths,
Frank Hellmann,
Mehrnaz Anvari
Abstract:
Partially synchronized solitary states occur frequently when a synchronized system of networked oscillators is perturbed locally. Several asymptotic states of different frequencies can coexist at the same node. Here, we reveal the mechanism behind this multistability: additional solitary frequencies arise from the coupling between network modes and the solitary oscillator's frequency, leading to s…
▽ More
Partially synchronized solitary states occur frequently when a synchronized system of networked oscillators is perturbed locally. Several asymptotic states of different frequencies can coexist at the same node. Here, we reveal the mechanism behind this multistability: additional solitary frequencies arise from the coupling between network modes and the solitary oscillator's frequency, leading to significant energy transfer. This can cause the solitary node's frequency to resonate with a Laplacian eigenvalue. We analyze which network structures enable this resonance and explain longstanding numerical observations. Another solitary state is characterized by the effective decoupling of the synchronized network and the solitary node at the natural frequency. Our framework unifies the description of solitary states near and far from resonance, allowing to predict the behavior of complex networks.
△ Less
Submitted 12 July, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing
Authors:
Sai Koneru,
Miriam Exel,
Matthias Huck,
Jan Niehues
Abstract:
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks, but they have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigat…
▽ More
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks, but they have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLM's for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments find that fine-tuning for translation purposes even led to performance degradation. To overcome this, we propose an alternative approach: adapting LLM's as Automatic Post-Editors (APE) rather than direct translators. Building on the LLM's exceptional ability to process and generate lengthy sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 89\% on the ContraPro test set, which specifically assesses the model's ability to resolve pronoun ambiguities when translating from English to German. Lastly, we investigate a practical scenario involving manual post-editing for document-level translation, where reference context is made available. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations (Interactive Demo for integrating manual feedback can be found here: https://huggingface.co/spaces/skoneru/contextual_refinement_ende).
△ Less
Submitted 18 March, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Audience-specific Explanations for Machine Translation
Authors:
Renhan Lou,
Jan Niehues
Abstract:
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example expl…
▽ More
In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
How Transferable are Attribute Controllers on Pretrained Multilingual Translation Models?
Authors:
Danni Liu,
Jan Niehues
Abstract:
Customizing machine translation models to comply with desired attributes (e.g., formality or grammatical gender) is a well-studied topic. However, most current approaches rely on (semi-)supervised data with attribute annotations. This data scarcity bottlenecks democratizing such customization possibilities to a wider range of languages, particularly lower-resource ones. This gap is out of sync wit…
▽ More
Customizing machine translation models to comply with desired attributes (e.g., formality or grammatical gender) is a well-studied topic. However, most current approaches rely on (semi-)supervised data with attribute annotations. This data scarcity bottlenecks democratizing such customization possibilities to a wider range of languages, particularly lower-resource ones. This gap is out of sync with recent progress in pretrained massively multilingual translation models. In response, we transfer the attribute controlling capabilities to languages without attribute-annotated data with an NLLB-200 model as a foundation. Inspired by techniques from controllable generation, we employ a gradient-based inference-time controller to steer the pretrained model. The controller transfers well to zero-shot conditions, as it operates on pretrained multilingual representations and is attribute -- rather than language-specific. With a comprehensive comparison to finetuning-based control, we demonstrate that, despite finetuning's clear dominance in supervised settings, the gap to inference-time control closes when moving to zero-shot conditions, especially with new and distant target languages. The latter also shows stronger domain robustness. We further show that our inference-time control complements finetuning. A human evaluation on a real low-resource language, Bengali, confirms our findings. Our code is https://github.com/dannigt/attribute-controller-transfer
△ Less
Submitted 24 January, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models
Authors:
Leonard Bärmann,
Rainer Kartmann,
Fabian Peller-Konrad,
Jan Niehues,
Alex Waibel,
Tamim Asfour
Abstract:
Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mis…
▽ More
Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mistakes in the future. In this paper, we propose a system to achieve incremental learning of complex behavior from natural interaction, and demonstrate its implementation on a humanoid robot. Building on recent advances, we present a system that deploys Large Language Models (LLMs) for high-level orchestration of the robot's behavior, based on the idea of enabling the LLM to generate Python statements in an interactive console to invoke both robot perception and action. The interaction loop is closed by feeding back human instructions, environment observations, and execution results to the LLM, thus informing the generation of the next statement. Specifically, we introduce incremental prompt learning, which enables the system to interactively learn from its mistakes. For that purpose, the LLM can call another LLM responsible for code-level improvements of the current interaction based on human feedback. The improved interaction is then saved in the robot's memory, and thus retrieved on similar requests. We integrate the system in the robot cognitive architecture of the humanoid robot ARMAR-6 and evaluate our methods both quantitatively (in simulation) and qualitatively (in simulation and real-world) by demonstrating generalized incrementally-learned knowledge.
△ Less
Submitted 16 May, 2024; v1 submitted 8 September, 2023;
originally announced September 2023.
-
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Authors:
Christian Huber,
Tu Anh Dinh,
Carlos Mullov,
Ngoc Quan Pham,
Thai Binh Nguyen,
Fabian Retkowski,
Stefan Constantin,
Enes Yavuz Ugan,
Danni Liu,
Zhaolin Li,
Sai Koneru,
Jan Niehues,
Alexander Waibel
Abstract:
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work…
▽ More
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
△ Less
Submitted 17 July, 2024; v1 submitted 7 August, 2023;
originally announced August 2023.
-
KIT's Multilingual Speech Translation System for IWSLT 2023
Authors:
Danni Liu,
Thai Binh Nguyen,
Sai Koneru,
Enes Yavuz Ugan,
Ngoc-Quan Pham,
Tuan-Nam Nguyen,
Tu Anh Dinh,
Carlos Mullov,
Alexander Waibel,
Jan Niehues
Abstract:
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and te…
▽ More
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
△ Less
Submitted 12 July, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Gender Lost In Translation: How Bridging The Gap Between Languages Affects Gender Bias in Zero-Shot Multilingual Translation
Authors:
Lena Cabrera,
Jan Niehues
Abstract:
Neural machine translation (NMT) models often suffer from gender biases that harm users and society at large. In this work, we explore how bridging the gap between languages for which parallel data is not available affects gender bias in multilingual NMT, specifically for zero-shot directions. We evaluate translation between grammatical gender languages which requires preserving the inherent gende…
▽ More
Neural machine translation (NMT) models often suffer from gender biases that harm users and society at large. In this work, we explore how bridging the gap between languages for which parallel data is not available affects gender bias in multilingual NMT, specifically for zero-shot directions. We evaluate translation between grammatical gender languages which requires preserving the inherent gender information from the source in the target language. We study the effect of encouraging language-agnostic hidden representations on models' ability to preserve gender and compare pivot-based and zero-shot translation regarding the influence of the bridge language (participating in all language pairs during training) on gender preservation. We find that language-agnostic representations mitigate zero-shot models' masculine bias, and with increased levels of gender inflection in the bridge language, pivoting surpasses zero-shot translation regarding fairer gender preservation for speaker-related gender agreement.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Perturbation-based QE: An Explainable, Unsupervised Word-level Quality Estimation Method for Blackbox Machine Translation
Authors:
Tu Anh Dinh,
Jan Niehues
Abstract:
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which r…
▽ More
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which requires glass-box access to the MT systems, or parallel MT data to generate synthetic errors for training QE models. In this paper, we present Perturbation-based QE - a word-level Quality Estimation approach that works simply by analyzing MT system output on perturbed input source sentences. Our approach is unsupervised, explainable, and can evaluate any type of blackbox MT systems, including the currently prominent large language models (LLMs) with opaque internal processes. For language directions with no labeled QE data, our approach has similar or better performance than the zero-shot supervised approach on the WMT21 shared task. Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE, indicating its robustness to out-of-domain usage. The performance gap is larger when detecting errors on a nontraditional translation-prompting LLM, indicating that our approach is more generalizable to different MT systems. We give examples demonstrating our approach's explainability power, where it shows which input source words have influence on a certain MT output word.
△ Less
Submitted 13 July, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages
Authors:
Zhong Zhou,
Jan Niehues,
Alex Waibel
Abstract:
In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, tran…
▽ More
In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Fully transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study
Authors:
Sophia J. Wagner,
Daniel Reisenbüchler,
Nicholas P. West,
Jan Moritz Niehues,
Gregory Patrick Veldhuizen,
Philip Quirke,
Heike I. Grabsch,
Piet A. van den Brandt,
Gordon G. A. Hutchins,
Susan D. Richman,
Tanwei Yuan,
Rupert Langer,
Josien Christina Anna Jenniskens,
Kelly Offermans,
Wolfram Mueller,
Richard Gray,
Stephen B. Gruber,
Joel K. Greenson,
Gad Rennert,
Joseph D. Bonner,
Daniel Schmolze,
Jacqueline A. James,
Maurice B. Loughrey,
Manuel Salto-Tellez,
Hermann Brenner
, et al. (6 additional authors not shown)
Abstract:
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but…
▽ More
Background: Deep learning (DL) can extract predictive and prognostic biomarkers from routine pathology slides in colorectal cancer. For example, a DL test for the diagnosis of microsatellite instability (MSI) in CRC has been approved in 2022. Current approaches rely on convolutional neural networks (CNNs). Transformer networks are outperforming CNNs and are replacing them in many applications, but have not been used for biomarker prediction in cancer at a large scale. In addition, most DL approaches have been trained on small patient cohorts, which limits their clinical utility. Methods: In this study, we developed a new fully transformer-based pipeline for end-to-end biomarker prediction from pathology slides. We combine a pre-trained transformer encoder and a transformer network for patch aggregation, capable of yielding single and multi-target prediction at patient level. We train our pipeline on over 9,000 patients from 10 colorectal cancer cohorts. Results: A fully transformer-based approach massively improves the performance, generalizability, data efficiency, and interpretability as compared with current state-of-the-art algorithms. After training on a large multicenter cohort, we achieve a sensitivity of 0.97 with a negative predictive value of 0.99 for MSI prediction on surgical resection specimens. We demonstrate for the first time that resection specimen-only training reaches clinical-grade performance on endoscopic biopsy tissue, solving a long-standing diagnostic problem. Interpretation: A fully transformer-based end-to-end pipeline trained on thousands of pathology slides yields clinical-grade performance for biomarker prediction on surgical resections and biopsies. Our new methods are freely available under an open source license.
△ Less
Submitted 1 March, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Diffusion Probabilistic Models beat GANs on Medical Images
Authors:
Gustav Müller-Franzes,
Jan Moritz Niehues,
Firas Khader,
Soroosh Tayebi Arasteh,
Christoph Haarburger,
Christiane Kuhl,
Tianci Wang,
Tianyu Han,
Sven Nebelung,
Jakob Nikolas Kather,
Daniel Truhn
Abstract:
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we…
▽ More
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Towards continually learning new languages
Authors:
Ngoc-Quan Pham,
Jan Niehues,
Alexander Waibel
Abstract:
Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in…
▽ More
Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly. Such combination allowed us to eliminate catastrophic forgetting while still achieving performance for the new languages comparable with having all languages at once, in experiments of learning from an initial 10 languages to achieve 26 languages without catastrophic forgetting and a reasonable performance compared to training all languages from scratch.
△ Less
Submitted 17 July, 2024; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Efficient Speech Translation with Pre-trained Models
Authors:
Zhaolin Li,
Jan Niehues
Abstract:
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech tr…
▽ More
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Learning an Artificial Language for Knowledge-Sharing in Multilingual Translation
Authors:
Danni Liu,
Jan Niehues
Abstract:
The cornerstone of multilingual neural translation is shared representations across languages. Given the theoretically infinite representation power of neural networks, semantically identical sentences are likely represented differently. While representing sentences in the continuous latent space ensures expressiveness, it introduces the risk of capturing of irrelevant features which hinders the l…
▽ More
The cornerstone of multilingual neural translation is shared representations across languages. Given the theoretically infinite representation power of neural networks, semantically identical sentences are likely represented differently. While representing sentences in the continuous latent space ensures expressiveness, it introduces the risk of capturing of irrelevant features which hinders the learning of a common representation. In this work, we discretize the encoder output latent space of multilingual models by assigning encoder states to entries in a codebook, which in effect represents source sentences in a new artificial language. This discretization process not only offers a new way to interpret the otherwise black-box model representations, but, more importantly, gives potential for increasing robustness in unseen testing conditions. We validate our approach on large-scale experiments with realistic data volumes and domains. When tested in zero-shot conditions, our approach is competitive with two strong alternatives from the literature. We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
△ Less
Submitted 18 November, 2022; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Adaptive multilingual speech recognition with pretrained models
Authors:
Ngoc-Quan Pham,
Alex Waibel,
Jan Niehues
Abstract:
Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretra…
▽ More
Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text, together with the adaptive weight techniques to massively improve the recognition quality on the public datasets containing CommonVoice and Europarl. Overall, we noticed an 44% improvement over purely supervised learning, and more importantly, each technique provides a different reinforcement in different languages. We also explore other possibilities to potentially obtain the best model by slightly adding either depth or relative attention to the architecture.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
LibriS2S: A German-English Speech-to-Speech Translation Corpus
Authors:
Pedro Jeuris,
Jan Niehues
Abstract:
Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address th…
▽ More
Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address this issue by creating LibriS2S, to our knowledge the first publicly available speech-to-speech training corpus between German and English. For this corpus, we used independently created audio for German and English leading to an unbiased pronunciation of the text in both languages. This allows the creation of a new text-to-speech and speech-to-speech translation model that directly learns to generate the speech signal based on the pronunciation of the source language. Using this created corpus, we propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model that integrates source language information. We do this by adapting the model to take information such as the pitch, energy or transcript from the source speech as additional input.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
Authors:
Peter Polák,
Ngoc-Quan Ngoc,
Tuan-Nam Nguyen,
Danni Liu,
Carlos Mullov,
Jan Niehues,
Ondřej Bojar,
Alexander Waibel
Abstract:
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set.…
▽ More
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available.
△ Less
Submitted 11 May, 2022; v1 submitted 12 April, 2022;
originally announced April 2022.
-
Multilingual Simultaneous Speech Translation
Authors:
Shashank Subramanya,
Jan Niehues
Abstract:
Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models…
▽ More
Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models, we investigate multilingual models and different architectures (end-to-end and cascade) on the ability to perform online speech translation. On the multilingual TEDx corpus, we show that the approach generalizes to different architectures. We see similar gains in latency reduction (40% relative) across languages and architectures. However, the end-to-end architecture leads to smaller translation quality losses after adapting to the online model. Furthermore, the approach even scales to zero-shot directions.
△ Less
Submitted 29 March, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
Event Generators for High-Energy Physics Experiments
Authors:
J. M. Campbell,
M. Diefenthaler,
T. J. Hobbs,
S. Höche,
J. Isaacson,
F. Kling,
S. Mrenna,
J. Reuter,
S. Alioli,
J. R. Andersen,
C. Andreopoulos,
A. M. Ankowski,
E. C. Aschenauer,
A. Ashkenazi,
M. D. Baker,
J. L. Barrow,
M. van Beekveld,
G. Bewick,
S. Bhattacharya,
C. Bierlich,
E. Bothmann,
P. Bredt,
A. Broggio,
A. Buckley,
A. Butter
, et al. (186 additional authors not shown)
Abstract:
We provide an overview of the status of Monte-Carlo event generators for high-energy particle physics. Guided by the experimental needs and requirements, we highlight areas of active development, and opportunities for future improvements. Particular emphasis is given to physics models and algorithms that are employed across a variety of experiments. These common themes in event generator developme…
▽ More
We provide an overview of the status of Monte-Carlo event generators for high-energy particle physics. Guided by the experimental needs and requirements, we highlight areas of active development, and opportunities for future improvements. Particular emphasis is given to physics models and algorithms that are employed across a variety of experiments. These common themes in event generator development lead to a more comprehensive understanding of physics at the highest energies and intensities, and allow models to be tested against a wealth of data that have been accumulated over the past decades. A cohesive approach to event generator development will allow these models to be further improved and systematic uncertainties to be reduced, directly contributing to future experimental success. Event generators are part of a much larger ecosystem of computational tools. They typically involve a number of unknown model parameters that must be tuned to experimental data, while maintaining the integrity of the underlying physics models. Making both these data, and the analyses with which they have been obtained accessible to future users is an essential aspect of open science and data preservation. It ensures the consistency of physics models across a variety of experiments.
△ Less
Submitted 23 January, 2024; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques
Authors:
Tu Anh Dinh,
Danni Liu,
Jan Niehues
Abstract:
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have…
▽ More
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have been proposed for zero-shot translation. A main idea is to increase the similarity of semantically similar sentences in different languages. We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data. We investigate the effects of data augmentation and auxiliary loss function. The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
Cost-Effective Training in Low-Resource Neural Machine Translation
Authors:
Sai Koneru,
Danni Liu,
Jan Niehues
Abstract:
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be help…
▽ More
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data and also most similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Impact of jet-production data on the next-to-next-to-leading-order determination of HERAPDF2.0 parton distributions
Authors:
H1,
ZEUS Collaborations,
:,
I. Abt,
R. Aggarwal,
V. Andreev,
M. Arratia,
V. Aushev,
A. Baghdasaryan,
A. Baty,
K. Begzsuren,
O. Behnke,
A. Belousov,
A. Bertolin,
I. Bloch,
V. Boudry,
G. Brandt,
I. Brock,
N. H. Brook,
R. Brugnera,
A. Bruni,
A. Buniatyan,
P. J. Bussey,
L. Bystritskaya,
A. Caldwell
, et al. (212 additional authors not shown)
Abstract:
The HERAPDF2.0 ensemble of parton distribution functions (PDFs) was introduced in 2015. The final stage is presented, a next-to-next-to-leading-order (NNLO) analysis of the HERA data on inclusive deep inelastic $ep$ scattering together with jet data as published by the H1 and ZEUS collaborations. A perturbative QCD fit, simultaneously of $α_s(M_Z^2)$ and and the PDFs, was performed with the result…
▽ More
The HERAPDF2.0 ensemble of parton distribution functions (PDFs) was introduced in 2015. The final stage is presented, a next-to-next-to-leading-order (NNLO) analysis of the HERA data on inclusive deep inelastic $ep$ scattering together with jet data as published by the H1 and ZEUS collaborations. A perturbative QCD fit, simultaneously of $α_s(M_Z^2)$ and and the PDFs, was performed with the result $α_s(M_Z^2) = 0.1156 \pm 0.0011~{\rm (exp)}~ ^{+0.0001}_{-0.0002}~ {\rm (model}$ ${\rm +~parameterisation)}~ \pm 0.0029~{\rm (scale)}$. The PDF sets of HERAPDF2.0Jets NNLO were determined with separate fits using two fixed values of $α_s(M_Z^2)$, $α_s(M_Z^2)=0.1155$ and $0.118$, since the latter value was already chosen for the published HERAPDF2.0 NNLO analysis based on HERA inclusive DIS data only. The different sets of PDFs are presented, evaluated and compared. The consistency of the PDFs determined with and without the jet data demonstrates the consistency of HERA inclusive and jet-production cross-section data. The inclusion of the jet data reduced the uncertainty on the gluon PDF. Predictions based on the PDFs of HERAPDF2.0Jets NNLO give an excellent description of the jet-production data used as input.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Self-organized quantization and oscillations on continuous fixed-energy sandpiles
Authors:
Jakob Niehues,
Gorm Gruner Jensen,
Jan O. Haerter
Abstract:
Atmospheric self-organization and activator-inhibitor dynamics in biology provide examples of checkerboard-like spatio-temporal organization. We study a simple model for local activation-inhibition processes. Our model, first introduced in the context of atmospheric moisture dynamics, is a continuous-energy and non-Abelian version of the fixed-energy sandpile model. Each lattice site is populated…
▽ More
Atmospheric self-organization and activator-inhibitor dynamics in biology provide examples of checkerboard-like spatio-temporal organization. We study a simple model for local activation-inhibition processes. Our model, first introduced in the context of atmospheric moisture dynamics, is a continuous-energy and non-Abelian version of the fixed-energy sandpile model. Each lattice site is populated by a non-negative real number, its energy. Upon each timestep all sites with energy exceeding a unit threshold re-distribute their energy at equal parts to their nearest neighbors. The limit cycle dynamics gives rise to a complex phase diagram in dependence on the mean energy $μ$: For low $μ$, all dynamics ceases after few re-distribution events. For large $μ$, the dynamics is well-described as a diffusion process, where the order parameter, spatial variance $σ$, is removed. States at intermediate $μ$ are dominated by checkerboard-like period-two phases which are however interspersed by much more complex phases of far longer periods. Phases are separated by discontinuous jumps in $σ$ or $\partial_μσ$ - akin to first and higher-order phase transitions. Overall, the energy landscape is dominated by few energy levels which occur as sharp spikes in the single-site density of states and are robust to noise.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Unsupervised Machine Translation On Dravidian Languages
Authors:
Sai Koneru,
Danni Liu,
Jan Niehues
Abstract:
Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between En…
▽ More
Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between English and Kannada, a low resource Dravidian language. We additionally utilize a limited amount of auxiliary data between English and other related Dravidian languages. We show that unifying the writing systems is essential in unsupervised translation between the Dravidian languages. We explore several model architectures that use the auxiliary data in order to maximize knowledge sharing and enable UNMT for distant language pairs. Our experiments demonstrate that it is crucial to include auxiliary languages that are similar to our focal language, Kannada. Furthermore, we propose a metric to measure language similarity and show that it serves as a good indicator for selecting the auxiliary languages.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
Continuous Learning in Neural Machine Translation using Bilingual Dictionaries
Authors:
Jan Niehues
Abstract:
While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges: The system ne…
▽ More
While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges: The system needs to be able to perform one-shot learning as well as model the morphology of source and target language.
In this work, we proposed an evaluation framework to assess the ability of neural machine translation to continuously learn new phrases. We integrate one-shot learning methods for neural machine translation with different word representations and show that it is important to address both in order to successfully make use of bilingual dictionaries. By addressing both challenges we are able to improve the ability to translate new, rare words and phrases from 30% to up to 70%. The correct lemma is even generated by more than 90%.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Improving Zero-Shot Translation by Disentangling Positional Information
Authors:
Danni Liu,
Jan Niehues,
James Cross,
Francisco Guzmán,
Xian Li
Abstract:
Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. W…
▽ More
Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.
△ Less
Submitted 30 June, 2021; v1 submitted 30 December, 2020;
originally announced December 2020.
-
The Large Hadron-Electron Collider at the HL-LHC
Authors:
P. Agostini,
H. Aksakal,
S. Alekhin,
P. P. Allport,
N. Andari,
K. D. J. Andre,
D. Angal-Kalinin,
S. Antusch,
L. Aperio Bella,
L. Apolinario,
R. Apsimon,
A. Apyan,
G. Arduini,
V. Ari,
A. Armbruster,
N. Armesto,
B. Auchmann,
K. Aulenbacher,
G. Azuelos,
S. Backovic,
I. Bailey,
S. Bailey,
F. Balli,
S. Behera,
O. Behnke
, et al. (312 additional authors not shown)
Abstract:
The Large Hadron electron Collider (LHeC) is designed to move the field of deep inelastic scattering (DIS) to the energy and intensity frontier of particle physics. Exploiting energy recovery technology, it collides a novel, intense electron beam with a proton or ion beam from the High Luminosity--Large Hadron Collider (HL-LHC). The accelerator and interaction region are designed for concurrent el…
▽ More
The Large Hadron electron Collider (LHeC) is designed to move the field of deep inelastic scattering (DIS) to the energy and intensity frontier of particle physics. Exploiting energy recovery technology, it collides a novel, intense electron beam with a proton or ion beam from the High Luminosity--Large Hadron Collider (HL-LHC). The accelerator and interaction region are designed for concurrent electron-proton and proton-proton operation. This report represents an update of the Conceptual Design Report (CDR) of the LHeC, published in 2012. It comprises new results on parton structure of the proton and heavier nuclei, QCD dynamics, electroweak and top-quark physics. It is shown how the LHeC will open a new chapter of nuclear particle physics in extending the accessible kinematic range in lepton-nucleus scattering by several orders of magnitude. Due to enhanced luminosity, large energy and the cleanliness of the hadronic final states, the LHeC has a strong Higgs physics programme and its own discovery potential for new physics. Building on the 2012 CDR, the report represents a detailed updated design of the energy recovery electron linac (ERL) including new lattice, magnet, superconducting radio frequency technology and further components. Challenges of energy recovery are described and the lower energy, high current, 3-turn ERL facility, PERLE at Orsay, is presented which uses the LHeC characteristics serving as a development facility for the design and operation of the LHeC. An updated detector design is presented corresponding to the acceptance, resolution and calibration goals which arise from the Higgs and parton density function physics programmes. The paper also presents novel results on the Future Circular Collider in electron-hadron mode, FCC-eh, which utilises the same ERL technology to further extend the reach of DIS to even higher centre-of-mass energies.
△ Less
Submitted 12 April, 2021; v1 submitted 28 July, 2020;
originally announced July 2020.
-
Adapting End-to-End Speech Recognition for Readable Subtitles
Authors:
Danni Liu,
Jan Niehues,
Gerasimos Spanakis
Abstract:
Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investi…
▽ More
Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Authors:
Danni Liu,
Gerasimos Spanakis,
Jan Niehues
Abstract:
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficie…
▽ More
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.
△ Less
Submitted 13 October, 2020; v1 submitted 22 May, 2020;
originally announced May 2020.
-
Relative Positional Encoding for Speech Recognition and Direct Translation
Authors:
Ngoc-Quan Pham,
Thanh-Le Ha,
Tuan-Nam Nguyen,
Thai-Son Nguyen,
Elizabeth Salesky,
Sebastian Stueker,
Jan Niehues,
Alexander Waibel
Abstract:
Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition…
▽ More
Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Machine Translation with Unsupervised Length-Constraints
Authors:
Jan Niehues
Abstract:
We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper, we explore one of these, the generation of constraint translation. We focus on length constraints, which are essential if the translation should be displayed in…
▽ More
We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper, we explore one of these, the generation of constraint translation. We focus on length constraints, which are essential if the translation should be displayed in a given format. In this work, we propose an end-to-end approach for this task. Compared to a traditional method that first translates and then performs sentence compression, the text compression is learned completely unsupervised. By combining the idea with zero-shot multilingual machine translation, we are also able to perform unsupervised monolingual sentence compression. In order to fulfill the length constraints, we investigated several methods to integrate the constraints into the model. Using the presented technique, we are able to significantly improve the translation quality under constraints. Furthermore, we are able to perform unsupervised monolingual sentence compression.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Low Latency ASR for Simultaneous Speech Translation
Authors:
Thai Son Nguyen,
Jan Niehues,
Eunah Cho,
Thanh-Le Ha,
Kevin Kilgour,
Markus Muller,
Matthias Sperber,
Sebastian Stueker,
Alex Waibel
Abstract:
User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we…
▽ More
User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we focused on word latency. We used it to analyze the performance of our current system and to identify opportunities for improvements. In order to minimize the latency we combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update that allows to revise the most recent parts of the transcription. This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance in terms of word error rate.
△ Less
Submitted 22 March, 2020;
originally announced March 2020.
-
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation
Authors:
Thai-Son Nguyen,
Sebastian Stueker,
Jan Niehues,
Alex Waibel
Abstract:
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhib…
▽ More
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data augmentation methods on the performance of two S2S model architectures. One of the data augmentation method comes from literature, while two other methods are our own development - a time perturbation in the frequency domain and sub-sequence sampling. Our experiments on Switchboard and Fisher data show state-of-the-art performance for S2S models that are trained solely on the speech training data and do not use additional text data.
△ Less
Submitted 3 February, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Modeling Confidence in Sequence-to-Sequence Models
Authors:
Jan Niehues,
Ngoc-Quan Pham
Abstract:
Recently, significant improvements have been achieved in various natural language processing tasks using neural sequence-to-sequence models. While aiming for the best generation quality is important, ultimately it is also necessary to develop models that can assess the quality of their output.
In this work, we propose to use the similarity between training and test conditions as a measure for mo…
▽ More
Recently, significant improvements have been achieved in various natural language processing tasks using neural sequence-to-sequence models. While aiming for the best generation quality is important, ultimately it is also necessary to develop models that can assess the quality of their output.
In this work, we propose to use the similarity between training and test conditions as a measure for models' confidence. We investigate methods solely using the similarity as well as methods combining it with the posterior probability. While traditionally only target tokens are annotated with confidence measures, we also investigate methods to annotate source tokens with confidence. By learning an internal alignment model, we can significantly improve confidence projection over using state-of-the-art external alignment tools. We evaluate the proposed methods on downstream confidence estimation for machine translation (MT). We show improvements on segment-level confidence estimation as well as on confidence estimation for source tokens. In addition, we show that the same methods can also be applied to other tasks using sequence-to-sequence models. On the automatic speech recognition (ASR) task, we are able to find 60% of the errors by looking at 20% of the data.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
Incremental processing of noisy user utterances in the spoken language understanding task
Authors:
Stefan Constantin,
Jan Niehues,
Alex Waibel
Abstract:
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved thro…
▽ More
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved through incremental processing of the user utterance and starting subactions while the utterance is still being uttered. In this work, we present a model-agnostic method to achieve high quality in processing incrementally produced partial utterances. Based on clean and noisy versions of the ATIS dataset, we show how to create datasets with our method to create low-latency natural language understanding components. We get improvements of up to 47.91 absolute percentage points in the metric F1-score.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Second-order QCD corrections to event shape distributions in deep inelastic scattering
Authors:
T. Gehrmann,
A. Huss,
J. Mo,
J. Niehues
Abstract:
We compute the next-to-next-to-leading order (NNLO) QCD corrections to event shape distributions and their mean values in deep inelastic lepton-nucleon scattering. The magnitude and shape of the corrections varies considerably between different variables. The corrections reduce the renormalization and factorization scale uncertainty of the predictions. Using a dispersive model to describe non-pert…
▽ More
We compute the next-to-next-to-leading order (NNLO) QCD corrections to event shape distributions and their mean values in deep inelastic lepton-nucleon scattering. The magnitude and shape of the corrections varies considerably between different variables. The corrections reduce the renormalization and factorization scale uncertainty of the predictions. Using a dispersive model to describe non-perturbative power corrections, we compare the NNLO QCD predictions with data from the H1 and ZEUS experiments. The newly derived corrections improve the theory description of the distributions and of their mean values.
△ Less
Submitted 6 September, 2019;
originally announced September 2019.
-
Improving Zero-shot Translation with Language-Independent Constraints
Authors:
Ngoc-Quan Pham,
Jan Niehues,
Thanh-Le Ha,
Alex Waibel
Abstract:
An important concern in training multilingual neural machine translation (NMT) is to translate between language pairs unseen during training, i.e zero-shot translation. Improving this ability kills two birds with one stone by providing an alternative to pivot translation which also allows us to better understand how the model captures information between languages.
In this work, we carried out a…
▽ More
An important concern in training multilingual neural machine translation (NMT) is to translate between language pairs unseen during training, i.e zero-shot translation. Improving this ability kills two birds with one stone by providing an alternative to pivot translation which also allows us to better understand how the model captures information between languages.
In this work, we carried out an investigation on this capability of the multilingual NMT models. First, we intentionally create an encoder architecture which is independent with respect to the source language. Such experiments shed light on the ability of NMT encoders to learn multilingual representations, in general. Based on such proof of concept, we were able to design regularization methods into the standard Transformer model, so that the whole architecture becomes more robust in zero-shot conditions. We investigated the behaviour of such models on the standard IWSLT 2017 multilingual dataset. We achieved an average improvement of 2.23 BLEU points across 12 language pairs compared to the zero-shot performance of a state-of-the-art multilingual system. Additionally, we carry out further experiments in which the effect is confirmed even for language pairs with multiple intermediate pivots.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Calculations for deep inelastic scattering using fast interpolation grid techniques at NNLO in QCD and the extraction of $α_s$ from HERA data
Authors:
D. Britzger,
J. Currie,
A. Gehrmann-De Ridder,
T. Gehrmann,
E. W. N. Glover,
C. Gwenlan,
A. Huss,
T. Morgan,
J. Niehues,
J. Pires,
K. Rabbertz,
M. R. Sutton
Abstract:
The extension of interpolation-grid frameworks for perturbative QCD calculations at next-to-next-to-leading order (NNLO) is presented for deep inelastic scattering (DIS) processes. A fast and flexible evaluation of higher-order predictions for any a posteriori choice of parton distribution functions (PDFs) or value of the strong coupling constant is essential in iterative fitting procedures to ext…
▽ More
The extension of interpolation-grid frameworks for perturbative QCD calculations at next-to-next-to-leading order (NNLO) is presented for deep inelastic scattering (DIS) processes. A fast and flexible evaluation of higher-order predictions for any a posteriori choice of parton distribution functions (PDFs) or value of the strong coupling constant is essential in iterative fitting procedures to extract PDFs and Standard Model parameters as well as for a detailed study of the scale dependence. The APPLfast project, described here, provides a generic interface between the parton-level Monte Carlo program NNLOJET and both the APPLgrid and fastNLO libraries for the production of interpolation grids at NNLO accuracy. Details of the interface for DIS processes are presented together with the required interpolation grids at NNLO, which are made available. They cover numerous inclusive jet measurements by the H1 and ZEUS experiments at HERA. An extraction of the strong coupling constant is performed as an application of the use of such grids and a best-fit value of $α_s(M_Z) = 0.1178\,(15)_\text{exp}\,(21)_\text{th}$ is obtained using the HERA inclusive jet cross section data.
△ Less
Submitted 27 August, 2021; v1 submitted 12 June, 2019;
originally announced June 2019.
-
Very Deep Self-Attention Networks for End-to-End Speech Recognition
Authors:
Ngoc-Quan Pham,
Thai-Son Nguyen,
Jan Niehues,
Markus Müller,
Sebastian Stüker,
Alexander Waibel
Abstract:
Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transfor…
▽ More
Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.
△ Less
Submitted 3 May, 2019; v1 submitted 30 April, 2019;
originally announced April 2019.
-
Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation
Authors:
Matthias Sperber,
Graham Neubig,
Jan Niehues,
Alex Waibel
Abstract:
Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. Howev…
▽ More
Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora.
In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and while they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end-to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task-trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Multi-task learning to improve natural language understanding
Authors:
Stefan Constantin,
Jan Niehues,
Alex Waibel
Abstract:
Recently advancements in sequence-to-sequence neural network architectures have led to an improved natural language understanding. When building a neural network-based Natural Language Understanding component, one main challenge is to collect enough training data. The generation of a synthetic dataset is an inexpensive and quick way to collect data. Since this data often has less variety than real…
▽ More
Recently advancements in sequence-to-sequence neural network architectures have led to an improved natural language understanding. When building a neural network-based Natural Language Understanding component, one main challenge is to collect enough training data. The generation of a synthetic dataset is an inexpensive and quick way to collect data. Since this data often has less variety than real natural language, neural networks often have problems to generalize to unseen utterances during testing. In this work, we address this challenge by using multi-task learning. We train out-of-domain real data alongside in-domain synthetic data to improve natural language understanding. We evaluate this approach in the domain of airline travel information with two synthetic datasets. As out-of-domain real data, we test two datasets based on the subtitles of movies and series. By using an attention-based encoder-decoder model, we were able to improve the F1-score over strong baselines from 80.76 % to 84.98 % in the smaller synthetic dataset.
△ Less
Submitted 15 February, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Jet production in charged-current deep-inelastic scattering to third order in QCD
Authors:
T. Gehrmann,
J. Niehues,
A. Huss,
A. Vogt,
D. M. Walker
Abstract:
The production of jets in charged-current deep-inelastic scattering (CC DIS) probes simultaneously the strong and the electroweak sectors of the Standard Model; its measurement provides important information on the quark flavour structure of the proton. We compute third-order (N3LO) perturbative QCD corrections to this process, fully differential in the jet and lepton kinematics. We observe a subs…
▽ More
The production of jets in charged-current deep-inelastic scattering (CC DIS) probes simultaneously the strong and the electroweak sectors of the Standard Model; its measurement provides important information on the quark flavour structure of the proton. We compute third-order (N3LO) perturbative QCD corrections to this process, fully differential in the jet and lepton kinematics. We observe a substantial reduction in the theory uncertainty, to sub-percent level throughout the relevant kinematical range, thus enabling precision phenomenology with jet observables.
△ Less
Submitted 19 March, 2019; v1 submitted 14 December, 2018;
originally announced December 2018.
-
Towards Fluent Translations from Disfluent Speech
Authors:
Elizabeth Salesky,
Susanne Burger,
Jan Niehues,
Alex Waibel
Abstract:
When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the…
▽ More
When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the data better-matched to typical translation text and significantly improving performance. However, with the rise of end-to-end speech translation systems, this intermediate step must be incorporated into the sequence-to-sequence architecture. Further, though translated speech datasets exist, they are typically news or rehearsed speech without many disfluencies (e.g. TED), or the disfluencies are translated into the references (e.g. Fisher). To generate clean translations from disfluent speech, cleaned references are necessary for evaluation. We introduce a corpus of cleaned target data for the Fisher Spanish-English dataset for this task. We compare how different architectures handle disfluencies and provide a baseline for removing disfluencies in end-to-end translation.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Optimizing Segmentation Granularity for Neural Machine Translation
Authors:
Elizabeth Salesky,
Andrew Runge,
Alex Coda,
Jan Niehues,
Graham Neubig
Abstract:
In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach to generating these subwords, as they are unsupervised, resource-free, and empirically effective. However, the granularity of these subword units is a hyperpar…
▽ More
In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach to generating these subwords, as they are unsupervised, resource-free, and empirically effective. However, the granularity of these subword units is a hyperparameter to be tuned for each language and task, using methods such as grid search. Tuning may be done inexhaustively or skipped entirely due to resource constraints, leading to sub-optimal performance. In this paper, we propose a method to automatically tune this parameter using only one training pass. We incrementally introduce new vocabulary online based on the held-out validation loss, beginning with smaller, general subwords and adding larger, more specific units over the course of training. Our method matches the results found with grid search, optimizing segmentation granularity without any additional training time. We also show benefits in training efficiency and performance improvements for rare words due to the way embeddings for larger units are incrementally constructed by combining those from smaller units.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.