Search | arXiv e-print repository

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Authors: Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, Lizhen Qu

Abstract: The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the… ▽ More The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2403.13494 [pdf, other]

Novel imprint of a dark photon from the 3-3-1-1 model

Authors: Doan Minh Luong, Phung Van Dong, Nguyen Huy Thao

Abstract: We investigate a dark photon that arises from the UV model based upon $SU(3)_C\otimes SU(3)_L\otimes U(1)_X \otimes U(1)_G$ (3-3-1-1) gauge symmetry, where the last three factors enlarge the electroweak symmetry encompassing electric charge $Q=T_3 - 1/ \sqrt{3}T_8 +X$ and dark charge $D = -2/\sqrt{3} T_8 +G$. It is well-established that this model addresses the questions of family number, neutrino… ▽ More We investigate a dark photon that arises from the UV model based upon $SU(3)_C\otimes SU(3)_L\otimes U(1)_X \otimes U(1)_G$ (3-3-1-1) gauge symmetry, where the last three factors enlarge the electroweak symmetry encompassing electric charge $Q=T_3 - 1/ \sqrt{3}T_8 +X$ and dark charge $D = -2/\sqrt{3} T_8 +G$. It is well-established that this model addresses the questions of family number, neutrino mass, and dark matter. It is shown in this work that if the 3-3-1-1 breaking scale is much bigger than the dark charge breaking scale, the relevant dark gauge boson $Z'$ is uniquely imprinted at TeV, avoiding dangerous FCNC processes, obeying precision electroweak measurements, as well as contributing to collider phenomena, even if no kinetic mixing is presented. The dark matter observables are perhaps governed by the dark charge breaking Higgs field instead of the dark photon. △ Less

Submitted 23 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: 16 pages, 3 figures, 2 tables; A scalar triplet relabelled for clarity

arXiv:2309.07370 [pdf]

doi 10.1021/acsenergylett.3c02296

Highly-Sensitive Resonance-Enhanced Organic Photodetectors for Shortwave Infrared Sensing

Authors: Hoang Mai Luong, Chokchai Kaiyasuan, Ahra Yi, Sangmin Chae, Brian Minki Kim, Patchareepond Panoy, Hyo Jung Kim, Vinich Promarak, Yasuo Miyata, Hidenori Nakayama, Thuc-Quyen Nguyen

Abstract: Shortwave infrared (SWIR) has various applications, including night vision, remote sensing, and medical imaging. SWIR organic photodetectors (OPDs) offer advantages such as flexibility, cost-effectiveness, and tunable properties, however, lower sensitivity and limited spectral coverage compared to inorganic counterparts are major drawbacks. Here, we propose a simple yet effective and widely applic… ▽ More Shortwave infrared (SWIR) has various applications, including night vision, remote sensing, and medical imaging. SWIR organic photodetectors (OPDs) offer advantages such as flexibility, cost-effectiveness, and tunable properties, however, lower sensitivity and limited spectral coverage compared to inorganic counterparts are major drawbacks. Here, we propose a simple yet effective and widely applicable strategy to extend the wavelength detection range of OPD to a longer wavelength, using resonant optical microcavity. We demonstrate a proof-of-concept in PTB7-Th:COTIC-4F blend system, achieving external quantum efficiency (EQE) > 50 % over a broad spectrum 450 - 1100 nm with a peak specific detectivity (D*) of 1.1E13 Jones at 1100 nm, while cut-off bandwidth, speed, and linearity are preserved. By employing a novel small-molecule acceptor IR6, a record high EQE = 35 % and D* = 4.1E12 Jones are obtained at 1150 nm. This research emphasizes the importance of optical design in optoelectronic devices, presenting a considerably simpler method to expand the photodetection range compared to a traditional approach that involves developing absorbers with narrow optical gaps. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.01076 [pdf, other]

Federated Few-shot Learning for Cough Classification with Edge Devices

Authors: Ngan Dao Hoang, Dat Tran-Anh, Manh Luong, Cong Tran, Cuong Pham

Abstract: Automatically classifying cough sounds is one of the most critical tasks for the diagnosis and treatment of respiratory diseases. However, collecting a huge amount of labeled cough dataset is challenging mainly due to high laborious expenses, data scarcity, and privacy concerns. In this work, our aim is to develop a framework that can effectively perform cough classification even in situations whe… ▽ More Automatically classifying cough sounds is one of the most critical tasks for the diagnosis and treatment of respiratory diseases. However, collecting a huge amount of labeled cough dataset is challenging mainly due to high laborious expenses, data scarcity, and privacy concerns. In this work, our aim is to develop a framework that can effectively perform cough classification even in situations when enormous cough data is not available, while also addressing privacy concerns. Specifically, we formulate a new problem to tackle these challenges and adopt few-shot learning and federated learning to design a novel framework, termed F2LCough, for solving the newly formulated problem. We illustrate the superiority of our method compared with other approaches on COVID-19 Thermal Face & Cough dataset, in which F2LCough achieves an average F1-Score of 86%. Our results show the feasibility of few-shot learning combined with federated learning to build a classification model of cough sounds. This new methodology is able to classify cough sounds in data-scarce situations and maintain privacy properties. The outcomes of this work can be a fundamental framework for building support systems for the detection and diagnosis of cough-related diseases. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: 21 pages, 5 figures

arXiv:2210.05610 [pdf, other]

MTet: Multi-domain Translation for English and Vietnamese

Authors: Chinh Ngo, Trieu H. Trinh, Long Phan, Hieu Tran, Tai Dang, Hieu Nguyen, Minh Nguyen, Minh-Thang Luong

Abstract: We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained m… ▽ More We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller. △ Less

Submitted 19 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2210.05598 [pdf, other]

Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Authors: Long Phan, Tai Dang, Hieu Tran, Trieu H. Trinh, Vy Phan, Lam D. Chau, Minh-Thang Luong

Abstract: Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encod… ▽ More Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5. △ Less

Submitted 29 January, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

arXiv:2208.04243 [pdf, other]

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation

Authors: Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen

Abstract: In this paper, we introduce a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours, consisting of 331K triplets of (sentence-lengthed audio, English source transcript sentence, Vietnamese target subtitle sentence). We also conduct empirical experiments using strong baselines and find that the traditional "Cascaded" approach still outperforms… ▽ More In this paper, we introduce a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours, consisting of 331K triplets of (sentence-lengthed audio, English source transcript sentence, Vietnamese target subtitle sentence). We also conduct empirical experiments using strong baselines and find that the traditional "Cascaded" approach still outperforms the modern "End-to-End" approach. To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study. We hope both our publicly available dataset and study can serve as a starting point for future research and applications on English-Vietnamese speech translation. Our dataset is available at https://github.com/VinAIResearch/PhoST △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: In Proceedings of INTERSPEECH 2022, to appear. The first three authors contributed equally to this work

arXiv:2201.10954 [pdf]

Enhanced Simulation of the Indian Summer Monsoon Rainfall Using Regional Climate Modeling and Continuous Data Assimilation

Authors: Srinivas Desamsetti, Hari Prasad Dasari, Sabique Langodan, Yesubabu Viswanadhapalli, Raju Attada, Thang M. Luong, Omar Knio, Edriss S. Titi, Ibrahim Hoteit

Abstract: This study assesses a Continuous Data Assimilation (CDA) dynamical-downscaling algorithm for enhancing the simulation of the Indian summer monsoon (ISM) system. CDA is a mathematically rigorous technique that has been recently introduced to constrain the large-scale features of high-resolution atmospheric models with coarse spatial scale data. It is similar to spectral nudging but does not require… ▽ More This study assesses a Continuous Data Assimilation (CDA) dynamical-downscaling algorithm for enhancing the simulation of the Indian summer monsoon (ISM) system. CDA is a mathematically rigorous technique that has been recently introduced to constrain the large-scale features of high-resolution atmospheric models with coarse spatial scale data. It is similar to spectral nudging but does not require any spectral decomposition for scales separation. This is expected to be particularly relevant for ISM, which involves various interactions between large-scale circulations and regional physical processes. Along with a control simulation, several downscaling simulations were conducted with the Weather Research and Forecasting (WRF) model using CDA, spectral (retaining different wavenumbers) and grid nudging for three ISM seasons: normal (2016), excess (2013), and drought (2009). The simulations are nested within the NCEP Final Analysis and the model outputs are evaluated against the observations. Compared to grid and spectral nudging, the simulations using CDA produce enhanced ISM features over the Indian subcontinent including the low-level jet, tropical easterly jet, easterly wind shear, and rainfall distributions for all investigated ISM seasons. The major ISM processes, in particular the monsoon inversion over the Arabian Sea, tropospheric temperature gradients and moist static energy over central India, and zonal wind shear over the monsoon region, are all better simulated with CDA. Spectral nudging outputs are found to be sensitive to the choice of the wavenumber, requiring careful tuning to provide robust simulations of the ISM system. In contrast, control and grid nudging generally fail to well reproduce some of the main ISM features. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: Research Article

arXiv:2111.10050 [pdf, other]

Combined Scaling for Zero-shot Transfer Learning

Authors: Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le

Abstract: We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution sh… ▽ More We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC. △ Less

Submitted 12 April, 2023; v1 submitted 19 November, 2021; originally announced November 2021.

arXiv:2110.03742 [pdf, other]

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Authors: Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, Orhan Firat

Abstract: Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence,… ▽ More Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x. △ Less

Submitted 24 September, 2021; originally announced October 2021.

Comments: EMNLP Findings 2021

arXiv:2109.13675 [pdf, other]

FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis

Authors: Manh Luong, Viet Anh Tran

Abstract: Recently, autoregressive neural vocoders have provided remarkable performance in generating high-fidelity speech and have been able to produce synthetic speech in real-time. However, autoregressive neural vocoders such as WaveFlow are capable of modeling waveform signals from mel-spectrogram, its number of parameters is significant to deploy on edge devices. Though NanoFlow, which has a small numb… ▽ More Recently, autoregressive neural vocoders have provided remarkable performance in generating high-fidelity speech and have been able to produce synthetic speech in real-time. However, autoregressive neural vocoders such as WaveFlow are capable of modeling waveform signals from mel-spectrogram, its number of parameters is significant to deploy on edge devices. Though NanoFlow, which has a small number of parameters, is a state-of-the-art autoregressive neural vocoder, the performance of NanoFlow is marginally lower than WaveFlow. Therefore, we propose a new type of autoregressive neural vocoder called FlowVocoder, which has a small memory footprint and is capable of generating high-fidelity audio in real-time. Our proposed model improves the density estimation of flow blocks by utilizing a mixture of Cumulative Distribution Functions (CDF) for bipartite transformation. Hence, the proposed model is capable of modeling waveform signals, while its memory footprint is much smaller than WaveFlow. As shown in experiments, FlowVocoder achieves competitive results with baseline methods in terms of both subjective and objective evaluation, also, it is more suitable for real-time text-to-speech applications. △ Less

Submitted 25 March, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

arXiv:2109.06270 [pdf, other]

STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

Authors: Tu Vu, Minh-Thang Luong, Quoc V. Le, Grady Simon, Mohit Iyyer

Abstract: Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA… ▽ More Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective. △ Less

Submitted 12 April, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: Accepted as a main conference paper at EMNLP 2021, 17 pages, 3 figures, 11 tables

arXiv:2107.06642 [pdf, other]

Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder

Authors: Manh Luong, Viet Anh Tran

Abstract: Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results, however, these methods lack the ability to disentangle speaker identity and linguistic content to achi… ▽ More Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results, however, these methods lack the ability to disentangle speaker identity and linguistic content to achieve good performance on unseen speaker scenarios. In this paper, we propose a new method based on feature disentanglement to tackle many to many voice conversion. The method has the capability to disentangle speaker identity and linguistic content from utterances, it can convert from many source speakers to many target speakers with a single autoencoder network. Moreover, it naturally deals with the unseen target speaker scenarios. We perform both objective and subjective evaluations to show the competitive performance of our proposed method compared with other state-of-the-art models in terms of naturalness and target speaker similarity. △ Less

Submitted 11 July, 2021; originally announced July 2021.

Journal ref: INTERSPEECH 2021

arXiv:2103.01011 [pdf]

doi 10.1038/s41467-021-22697-w

Sub-second and ppm-level Optical Sensing of Hydrogen Using Templated Control of Nano-hydride Geometry and Composition

Authors: Hoang Mai Luong, Minh Thien Pham, Tyler Guin, Richa Pokharel Madhogaria, Manh-Huong Phan, George K. Larsen, Tho Duc Nguyen

Abstract: The use of hydrogen as a clean and renewable alternative to fossil fuels requires a suite of flammability mitigating technologies, particularly robust sensors for hydrogen leak detection and concentration monitoring. To this end, we have developed a class of lightweight optical hydrogen sensors based on a metasurface of Pd nano-patchy particle arrays, which fulfills the increasing requirements of… ▽ More The use of hydrogen as a clean and renewable alternative to fossil fuels requires a suite of flammability mitigating technologies, particularly robust sensors for hydrogen leak detection and concentration monitoring. To this end, we have developed a class of lightweight optical hydrogen sensors based on a metasurface of Pd nano-patchy particle arrays, which fulfills the increasing requirements of a safe hydrogen fuel sensing system with no risk of sparking. The structure of the optical sensor is readily nano-engineered to yield extraordinarily rapid response to hydrogen gas (<3 s at 1 mbar H$_{2}$) with a high degree of accuracy (<5%). By incorporating 20% Ag, Au or Co, the sensing performances of the Pd-alloy sensor are significantly enhanced, especially for the Pd$_{80}$Co$_{20}$ sensor whose optical response time at 1 mbar of H$_{2}$ is just ~0.85 s, while preserving the excellent accuracy (<2.5%), limit of detection (2.5 ppm), and robustness against aging, temperature, and interfering gases. The superior performance of our sensor places it among the fastest and most sensitive optical hydrogen sensors. △ Less

Submitted 1 March, 2021; originally announced March 2021.

arXiv:2012.08561 [pdf, other]

Pre-Training Transformers as Energy-Based Cloze Models

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Abstract: We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train… ▽ More We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method. Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text: it re-ranks speech recognition n-best lists better than language models and much faster than masked language models. Furthermore, it offers a clearer and more principled view of what ELECTRA learns during pre-training. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: EMNLP 2020

arXiv:2011.04419 [pdf, other]

Towards Domain-Agnostic Contrastive Learning

Authors: Vikas Verma, Minh-Thang Luong, Kenji Kawaguchi, Hieu Pham, Quoc V. Le

Abstract: Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, d… ▽ More Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, data augmentation techniques, are not readily available. Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels. To demonstrate the effectiveness of DACL, we conduct experiments across various domains such as tabular data, images, and graphs. Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR, to improve self-supervised visual representation learning. Finally, we theoretically analyze our method and show advantages over the Gaussian-noise based contrastive learning approach. △ Less

Submitted 19 July, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: Published in ICML 2021

arXiv:2010.00198 [pdf, other]

Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models

Authors: Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong

Abstract: Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider th… ▽ More Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model (CaPu) into the pipeline. The model takes input text from an ASR system and performs two tasks at the same time, producing proper text formatting that helps to improve NER performance. Experimental results indicated that the CaPu model helps to improve by nearly 4% of F1-score. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted in Interspeech 2020

arXiv:2008.11938 [pdf]

Highly transparent contacts to the 1D hole gas in ultra-scaled Ge/Si core/shell nanowires

Authors: Masiar Sistani, Jovian Delaforce, Roman Kramer, Nicolas Roch, Minh Anh Luong, M. den Hertog, Eric Robin, Jürgen Smoliner, Jun Yao, Charles Lieber, Cécile Naud, Alois Lugstein, Olivier Buisson

Abstract: Semiconductor-superconductor hybrid systems have outstanding potential for emerging high-performance nanoelectronics and quantum devices. However, critical to their successful application is the fabrication of high-quality and reproducible semiconductor-superconductor interfaces. Here, we realize and measure axial Al-Ge-Al nanowire heterostructures with atomically precise interfaces, enwrapped by… ▽ More Semiconductor-superconductor hybrid systems have outstanding potential for emerging high-performance nanoelectronics and quantum devices. However, critical to their successful application is the fabrication of high-quality and reproducible semiconductor-superconductor interfaces. Here, we realize and measure axial Al-Ge-Al nanowire heterostructures with atomically precise interfaces, enwrapped by an ultrathin epitaxial Si layer further denoted as Al-Ge/Si-Al nanowire heterostructures. The heterostructures were synthesized by a thermally induced exchange reaction of single-crystalline Ge/Si core/shell nanowires and lithographically defined Al contact pads. Applying this heterostructure formation scheme enables self-aligned quasi one-dimensional crystalline Al leads contacting ultrascaled Ge/Si segments with contact transparencies greater than 96%. Integration into back-gated field-effect devices and continuous scaling beyond lithographic limitations allows us to exploit the full potential of the highly transparent contacts to the 1D hole gas at the Ge-Si interface. This leads to the observation of ballistic transport as well as quantum confinement effects up to temperatures of 150 K. Low-temperature measurements reveal proximity-induced superconductivity in the Ge/Si core/shell nanowires. The realization of a Josephson field-effect transistor allows us to study the subgap structure caused by multiple Andreev reflections. Most importantly, the absence of a quantum dot regime indicates a hard superconducting gap originating from the highly transparent contacts to the 1D hole gas, which is potentially interesting for the study of Majorana zero modes. Moreover, underlining the importance of the proposed thermally induced Al-Ge/Si-Al heterostructure formation technique, our system could contribute to the development of key components of quantum computing such as gatemon or transmon qubits △ Less

Submitted 27 August, 2020; originally announced August 2020.

Journal ref: ACS Nano, American Chemical Society, 2019, 13 (12), pp.14145-14151

arXiv:2006.08385 [pdf]

doi 10.1021/acsphotonics.0c00557

Plasmon-Driven Hot Electron Transfer at Atomically Sharp Metal-Semiconductor Nanojunctions

Authors: Masiar Sistani, Maximilian G. Bartmann, Nicholas A. Güsken, Rupert F. Oulton, Hamid Keshmiri, Minh Anh Luong, Zahra Sadre-Momtaz, Martien I. den Hertog, Alois Lugstein

Abstract: Recent advances in guiding and localizing light at the nanoscale exposed the enormous potential of ultra-scaled plasmonic devices. In this context, the decay of surface plasmons to hot carriers triggers a variety of applications in boosting the efficiency of energy-harvesting, photo-catalysis and photo-detection. However, a detailed understanding of plasmonic hot carrier generation and particularl… ▽ More Recent advances in guiding and localizing light at the nanoscale exposed the enormous potential of ultra-scaled plasmonic devices. In this context, the decay of surface plasmons to hot carriers triggers a variety of applications in boosting the efficiency of energy-harvesting, photo-catalysis and photo-detection. However, a detailed understanding of plasmonic hot carrier generation and particularly the transfer at metal-semiconductor interfaces is still elusive. In this paper, we introduce a monolithic metal-semiconductor (Al-Ge) heterostructure device, providing a platform to examine surface plasmon decay and hot electron transfer at an atomically sharp Schottky nanojunction. The gated metal-semiconductor heterojunction device features electrostatic control of the Schottky barrier height at the Al-Ge interface, enabling hot electron filtering. The ability of momentum matching and to control the energy distribution of plasmon-driven hot electron injection is demonstrated by controlling the interband electron transfer in Ge leading to negative differential resistance. △ Less

Submitted 15 June, 2020; originally announced June 2020.

arXiv:2003.10580 [pdf, other]

Meta Pseudo Labels

Authors: Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, Quoc V. Le

Abstract: We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher i… ▽ More We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student's performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student. Our code will be available at https://github.com/google-research/google-research/tree/master/meta_pseudo_labels. △ Less

Submitted 1 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

Comments: Preprint

arXiv:2003.10555 [pdf, other]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called rep… ▽ More Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: ICLR 2020

arXiv:2003.02597

AI outperformed every dermatologist: Improved dermoscopic melanoma diagnosis through customizing batch logic and loss function in an optimized Deep CNN architecture

Authors: Cong Tri Pham, Mai Chi Luong, Dung Van Hoang, Antoine Doucet

Abstract: Melanoma, one of most dangerous types of skin cancer, re-sults in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent research has used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, an imbalance of sensitivity and specificity measures affected the pe… ▽ More Melanoma, one of most dangerous types of skin cancer, re-sults in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent research has used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, an imbalance of sensitivity and specificity measures affected the performance of existing models. This study proposes a method using deep convolutional neural networks aiming to detect melanoma as a binary classification problem. It involves 3 key features, namely customized batch logic, customized loss function and reformed fully connected layers. The training dataset is kept up to date including 17,302 images of melanoma and nevus; this is the largest dataset by far. The model performance is compared to that of 157 dermatologists from 12 university hospitals in Germany based on MClass-D dataset. The model outperformed all 157 dermatologists and achieved state-of-the-art performance with AUC at 94.4% with sensitivity of 85.0% and specificity of 95.0% using a prediction threshold of 0.5 on the MClass-D dataset of 100 dermoscopic images. Moreover, a threshold of 0.40858 showed the most balanced measure compared to other researches, and is promisingly application to medical diagnosis, with sensitivity of 90.0% and specificity of 93.8%. △ Less

Submitted 28 August, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: We are submitting the article in the journal and waiting for the review result, so we want to temporarily delete the article. When the article is officially accepted, it will be resubmitted

arXiv:2002.02373 [pdf]

Reversible Al Propagation in Si$_x$Ge$_{1-x}$ Nanowires

Authors: Minh Anh Luong, Robin Eric, Pauc Nicolas, Gentile Pascal, Baron Thierry, Salem Bassem, Sistani Masiar, Lugstein Alois, Spies Maria, Fernandez Bruno, M. den Hertog

Abstract: While reversibility is a fundamental concept in thermodynamics, most reactions are not readily reversible, especially in solid state physics. For example, thermal diffusion is a widely known concept, used among others to inject dopant atoms into the substitutional positions in the matrix and improve the device properties. Typically, such a diffusion process will create a concentration gradient ext… ▽ More While reversibility is a fundamental concept in thermodynamics, most reactions are not readily reversible, especially in solid state physics. For example, thermal diffusion is a widely known concept, used among others to inject dopant atoms into the substitutional positions in the matrix and improve the device properties. Typically, such a diffusion process will create a concentration gradient extending over increasingly large regions, without possibility to reverse this effect. On the other hand, while the bottom up growth of semiconducting nanowires is interesting, it can still be difficult to fabricate axial heterostructures with high control. In this paper, we report a reversible thermal diffusion process occurring in the solid-state exchange reaction between an Al metal pad and a Si$_x$Ge$_{1-x}$ alloy nanowire observed by in-situ transmission electron microscopy. The thermally assisted reaction results in the creation of a Si-rich region sandwiched between the reacted Al and unreacted SixGe1-x part, forming an axial Al/Si/Si$_x$Ge$_{1-x}$ heterostructure. Upon heating or (slow) cooling, the Al metal can repeatably move in and out of the Si$_x$Ge$_{1-x}$ alloy nanowire while maintaining the rod-like geometry and crystallinity, allowing to fabricate and contact nanowire heterostructures in a reversible way in a single process step, compatible with current Si based technology. This interesting system is promising for various applications, such as phase change memories in an all crystalline system with integrated contacts, as well as Si/Si$_x$Ge$_{1-x}$/Si heterostructures for near-infrared sensing applications. △ Less

Submitted 6 February, 2020; originally announced February 2020.

arXiv:2001.09977 [pdf, other]

Towards a Human-like Open-Domain Chatbot

Authors: Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le

Abstract: We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation.… ▽ More We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated. △ Less

Submitted 27 February, 2020; v1 submitted 27 January, 2020; originally announced January 2020.

Comments: 38 pages, 12 figures

arXiv:2001.09179 [pdf, other]

doi 10.1088/1361-6528/ab99f0

Correlated and in-situ electrical transmission electron microscopy studies and related membrane fabrication

Authors: Maria Spies, Zahra Sadre-Momtaz, Jonas Lähnemann, Minh Anh Luong, Bruno Fernandez, Thierry Fournier, Eva Monroy, Martien I. den Hertog

Abstract: Understanding the interplay between the structure, composition and opto-electronic properties of semiconductor nano-objects requires combining transmission electron microscopy (TEM) based techniques with electrical and optical measurements on the very same specimen. Recent developments in TEM technologies allow not only the identification and in-situ electrical characterization of a particular obj… ▽ More Understanding the interplay between the structure, composition and opto-electronic properties of semiconductor nano-objects requires combining transmission electron microscopy (TEM) based techniques with electrical and optical measurements on the very same specimen. Recent developments in TEM technologies allow not only the identification and in-situ electrical characterization of a particular object, but also the direct visualization of its modification in-situ by techniques such as Joule heating. Over the past years, we have carried out a number of studies in these fields that are reviewed in this contribution. In particular, we discuss here i) correlated studies where the same unique object is characterized electro-optically and by TEM, ii) in-situ Joule heating studies where a solid-state metal-semiconductor reaction is monitored in the TEM, and iii) in-situ biasing studies to better understand the electrical properties of contacted single nanowires. In addition, we provide detailed fabrication steps for the silicon nitride membranes crucial to these correlated and in-situ measurements. △ Less

Submitted 2 December, 2021; v1 submitted 24 January, 2020; originally announced January 2020.

Comments: This is an author-created, un-copyedited version of a topical review published in Nanotechnology. IOP Publishing Ltd. is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. The Version of Record is available online at https://doi.org/10.1088/1361-6528/ab99f0

Journal ref: Nanotechnology 31, 472001 (2020)

arXiv:2001.09026 [pdf]

In-situ high resolution TEM observation of Aluminum solid-state diffusion in Germanium nanowires: fabricating sub-10 nm Ge quantum dots

Authors: M. Luong, E. Robin, N. Pauc, P. Gentile, M. Sistani, A. Lugstein, M Spies, B Fernandez, M. den Hertog

Abstract: Aluminum-germanium nanowires (NWs) thermal activated solid state reaction is a promising system as very sharp and well defined one dimensional contacts can be created between a metal and a semiconductor, that can become a quantum dot if the size becomes sufficiently small. In the search for high performance devices without variability, it is of high interest to allow deterministic fabrication of n… ▽ More Aluminum-germanium nanowires (NWs) thermal activated solid state reaction is a promising system as very sharp and well defined one dimensional contacts can be created between a metal and a semiconductor, that can become a quantum dot if the size becomes sufficiently small. In the search for high performance devices without variability, it is of high interest to allow deterministic fabrication of nanowire quantum dots, avoiding sample variability and obtaining atomic scale precision on the fabricated dot size. In this paper, we present a reliable fabrication process to produce sub-10 nm Ge quantum dots (QDs), using a combination of ex-situ thermal annealing via rapid thermal annealing (RTA) and in-situ Joule heating technique in a transmission electron microscope (TEM). First we present in-situ direct joule heating experiments showing how the heating electrode could be damaged due to the formation of Al crystals and voids at the vicinity of the metal/NW contact, likely related with electro-migration phenomena. We show that the contact quality can be preserved by including an additional ex-situ RTA step prior to the in-situ heating. The in-situ observations also show in real-time how the exchange reaction initiates simultaneously from several locations underneath the Al contact pad, and the Al crystal grows gradually inside the initial Ge NW with the growth interface along a Ge(111) lattice plane. Once the reaction front moves out from underneath the contact metal, two factors jeopardize an atomically accurate control of the Al/Ge reaction interface. We observed a local acceleration of the reaction interface due to the electron beam irradiation in the transmission electron microscope as well as the appearance of large jumps of the interface in unpassivated Ge wires while a smooth advancement of the reaction interface was observed in wires with an Al2O3 protecting shell on the surface. Carefully controlling all aspects of the exchange reaction, we demonstrate a fabrication process combining ex-situ and in-situ heating techniques to precisely control and produce axial Al/Ge/Al NW heterostructures with an ultra-short Ge segment down to 8 nanometers. Practically, the scaling down of Ge segment length is only limited by the microscope resolution. △ Less

Submitted 24 January, 2020; originally announced January 2020.

arXiv:1911.08117 [pdf, ps, other]

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

Authors: Minh-Thang Luong, Preslav Nakov, Min-Yen Kan

Abstract: We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phr… ▽ More We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments. △ Less

Submitted 19 November, 2019; originally announced November 2019.

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: EMNLP-2010

arXiv:1911.04252 [pdf, other]

Self-training with Noisy Student improves ImageNet classification

Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le

Abstract: We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mea… ▽ More We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. Code is available at https://github.com/google-research/noisystudent. △ Less

Submitted 19 June, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

Comments: CVPR 2020

arXiv:1910.13299 [pdf, other]

Findings of the Third Workshop on Neural Generation and Translation

Authors: Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, Katsuhito Sudoh

Abstract: This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where pa… ▽ More This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document-level generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language. △ Less

Submitted 29 October, 2019; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: Fixed the metadata (author list)

arXiv:1907.04829 [pdf, other]

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Authors: Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, Quoc V. Le

Abstract: It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task m… ▽ More It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: ACL 2019

arXiv:1906.02940 [pdf, other]

Selfie: Self-supervised Pretraining for Image Embedding

Authors: Trieu H. Trinh, Minh-Thang Luong, Quoc V. Le

Abstract: We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, amo… ▽ More We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This classification objective sidesteps the need for predicting exact pixel values of the target patches. The pretraining architecture of Selfie includes a network of convolutional blocks to process patches followed by an attention pooling network to summarize the content of unmasked patches before predicting masked ones. During finetuning, we reuse the convolutional weights found by pretraining. We evaluate Selfie on three benchmarks (CIFAR-10, ImageNet 32 x 32, and ImageNet 224 x 224) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improvements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably, on ImageNet 224 x 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy. Our pretraining method also improves ResNet-50 training stability, especially on low data regime, by significantly lowering the standard deviation of test accuracies across different runs. △ Less

Submitted 27 July, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

arXiv:1904.12848 [pdf, other]

Unsupervised Data Augmentation for Consistency Training

Authors: Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le

Abstract: Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality… ▽ More Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used. Code is available at https://github.com/google-research/uda. △ Less

Submitted 5 November, 2020; v1 submitted 29 April, 2019; originally announced April 2019.

Comments: NeurIPS 2020

arXiv:1902.06625 [pdf]

Magnetically Tunable Organic Semiconductors with Superparamagnetic Nanoparticles

Authors: Rugang Geng, Hoang Mai Luong, Raja Das, Kristen Stojak, Minh Thien Pham, Joshua Robles-Garcia, Tuan Anh Duong, Huy Thanh Pham, Thi Huong Au, Ngoc Diep Lai, George K. Larsen, Manh-Huong Phan, Tho Duc Nguyen

Abstract: Magnetic nanoparticles (MNPs) exhibiting superparamagnetic properties might generate large magnetic dipole-dipole interaction with electron spins in organic semiconductors (OSECs). This concept could be considered analogous to the effect of hyperfine interaction (HFI). In order to investigate this model, Fe3O4 MNPs are used as a dopant for generating random hyperfine-like magnetic fields in a HFI-… ▽ More Magnetic nanoparticles (MNPs) exhibiting superparamagnetic properties might generate large magnetic dipole-dipole interaction with electron spins in organic semiconductors (OSECs). This concept could be considered analogous to the effect of hyperfine interaction (HFI). In order to investigate this model, Fe3O4 MNPs are used as a dopant for generating random hyperfine-like magnetic fields in a HFI-dominant πぱい-conjugated polymer host, poly(2-methoxy-5-(2-ethylhexyloxy)-1,4-phenylenevinylene) (MeH-PPV). The magnetoconductance (MC) response in organic light emitting diodes made by MeH-PPV/MNP blends is used to estimate the effective hyperfine field in the blends. Firstly, we find that the shape of the MC response essentially remains the same regardless of the MNP concentration, which is attributed to the similar functionality between the nuclear spins and the MNPs. Secondly, the width of MC increases with increasing MNP concentration. Magneto-optical Kerr effect experiments and micromagnetic simulation indicate that the additional increase of the MC width is associated with the strength of the magnetization of the blend. Finally, the MC broadening has the same temperature dependent trend as the magnetization of the MNPs where the unique effect of the MNPs in their superparamagnetic and ferromagnetic regimes on the MC response is observed. Magneto-photoinduced absorption (MPA) spectroscopy confirms that the MC broadening is not due to defects introduced by the MNPs, but is a result of unique superparamagnetic behavior. Our study yields a new pathway for tuning OSECs' magnetic functionality, which is essential to organic optoelectronic devices and magnetic sensor applications. △ Less

Submitted 11 June, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

arXiv:1809.08370 [pdf, other]

Semi-Supervised Sequence Modeling with Cross-View Training

Authors: Kevin Clark, Minh-Thang Luong, Christopher D. Manning, Quoc V. Le

Abstract: Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that imp… ▽ More Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results. △ Less

Submitted 21 September, 2018; originally announced September 2018.

Comments: EMNLP 2018

arXiv:1809.07070 [pdf, other]

Latent Topic Conversational Models

Authors: Tsung-Hsien Wen, Minh-Thang Luong

Abstract: Latent variable models have been a preferred choice in conversational modeling compared to sequence-to-sequence (seq2seq) models which tend to generate generic and repetitive responses. Despite so, training latent variable models remains to be difficult. In this paper, we propose Latent Topic Conversational Model (LTCM) which augments seq2seq with a neural latent topic component to better guide re… ▽ More Latent variable models have been a preferred choice in conversational modeling compared to sequence-to-sequence (seq2seq) models which tend to generate generic and repetitive responses. Despite so, training latent variable models remains to be difficult. In this paper, we propose Latent Topic Conversational Model (LTCM) which augments seq2seq with a neural latent topic component to better guide response generation and make training easier. The neural topic component encodes information from the source sentence to build a global "topic" distribution over words, which is then consulted by the seq2seq model at each generation step. We study in details how the latent representation is learnt in both the vanilla model and LTCM. Our extensive experiments contribute to better understanding and training of conditional latent models for languages. Our results show that by sampling from the learnt latent representations, LTCM can generate diverse and interesting responses. In a subjective human evaluation, the judges also confirm that LTCM is the overall preferred option. △ Less

Submitted 19 September, 2018; originally announced September 2018.

arXiv:1806.02940 [pdf, other]

Findings of the Second Workshop on Neural Machine Translation and Generation

Authors: Alexandra Birch, Andrew Finch, Minh-Thang Luong, Graham Neubig, Yusuke Oda

Abstract: This document describes the findings of the Second Workshop on Neural Machine Translation and Generation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2018). First, we summarize the research trends of papers presented in the proceedings, and note that there is particular interest in linguistic structure, domain adaptation, data augmentation, hand… ▽ More This document describes the findings of the Second Workshop on Neural Machine Translation and Generation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2018). First, we summarize the research trends of papers presented in the proceedings, and note that there is particular interest in linguistic structure, domain adaptation, data augmentation, handling inadequate resources, and analysis of models. Second, we describe the results of the workshop's shared task on efficient neural machine translation, where participants were tasked with creating MT systems that are both accurate and efficient. △ Less

Submitted 18 June, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: WNMT 2018

arXiv:1804.09541 [pdf, other]

QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension

Authors: Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, Quoc V. Le

Abstract: Current end-to-end machine reading and question answering (Q\&A) models are primarily based on recurrent neural networks (RNNs) with attention. Despite their success, these models are often slow for both training and inference due to the sequential nature of RNNs. We propose a new Q\&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolu… ▽ More Current end-to-end machine reading and question answering (Q\&A) models are primarily based on recurrent neural networks (RNNs) with attention. Despite their success, these models are often slow for both training and inference due to the sequential nature of RNNs. We propose a new Q\&A architecture called QANet, which does not require recurrent networks: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models. The speed-up gain allows us to train the model with much more data. We hence combine our model with data generated by backtranslation from a neural machine translation model. On the SQuAD dataset, our single model, trained with augmented data, achieves 84.6 F1 score on the test set, which is significantly better than the best published F1 score of 81.8. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: Published as full paper in ICLR 2018

arXiv:1803.00144 [pdf, ps, other]

Learning Longer-term Dependencies in RNNs with Auxiliary Losses

Authors: Trieu H. Trinh, Andrew M. Dai, Minh-Thang Luong, Quoc V. Le

Abstract: Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary lo… ▽ More Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation. △ Less

Submitted 13 June, 2018; v1 submitted 28 February, 2018; originally announced March 2018.

Comments: ICML 2018

arXiv:1711.11373 [pdf, ps, other]

Demonstration of a 2x2 programmable phase plate for electrons

Authors: Jo Verbeeck, Armand Béché, Knut Müller-Caspary, Giulio Guzzinati, Minh Anh Luong, Martien Den Hertog

Abstract: First results on the experimental realisation of a 2x2 programmable phase plate for electrons are presented. The design consists of an array of electrostatic einzel lenses that influence the phase of electron waves passing through 4 separately controllable aperture holes. This functionality is demonstrated in a conventional transmission electron microscope operating at 300~kV and results are in ve… ▽ More First results on the experimental realisation of a 2x2 programmable phase plate for electrons are presented. The design consists of an array of electrostatic einzel lenses that influence the phase of electron waves passing through 4 separately controllable aperture holes. This functionality is demonstrated in a conventional transmission electron microscope operating at 300~kV and results are in very close agreement with theoretical predictions. The dynamic creation of a set of electron probes with different phase symmetry is demonstrated, thereby bringing adaptive optics in TEM one step closer to reality. The limitations of the current design and how to overcome these in the future are discussed. Simulations show how further evolved versions of the current proof of concept might open new and exciting application prospects for beam shaping and aberration correction. △ Less

Submitted 30 November, 2017; originally announced November 2017.

arXiv:1710.02076 [pdf, other]

On the Effective Use of Pretraining for Natural Language Inference

Authors: Ignacio Cases, Minh-Thang Luong, Christopher Potts

Abstract: Neural networks have excelled at many NLP tasks, but there remain open questions about the performance of pretrained distributed word representations and their interaction with weight initialization and other hyperparameters. We address these questions empirically using attention-based sequence-to-sequence models for natural language inference (NLI). Specifically, we compare three types of embeddi… ▽ More Neural networks have excelled at many NLP tasks, but there remain open questions about the performance of pretrained distributed word representations and their interaction with weight initialization and other hyperparameters. We address these questions empirically using attention-based sequence-to-sequence models for natural language inference (NLI). Specifically, we compare three types of embeddings: random, pretrained (GloVe, word2vec), and retrofitted (pretrained plus WordNet information). We show that pretrained embeddings outperform both random and retrofitted ones in a large NLI corpus. Further experiments on more controlled data sets shed light on the contexts for which retrofitted embeddings can be useful. We also explore two principled approaches to initializing the rest of the model parameters, Gaussian and orthogonal, showing that the latter yields gains of up to 2.9% in the NLI task. △ Less

Submitted 5 October, 2017; originally announced October 2017.

Comments: This manuscript dates from late Winter 2016

arXiv:1707.00110 [pdf, other]

Efficient Attention using a Fixed-Size Memory Representation

Authors: Denny Britz, Melody Y. Guan, Minh-Thang Luong

Abstract: The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an alternative attention mechanism based on a fixed size memory representation that is more efficient. Our technique predicts a compact set of K attention contexts duri… ▽ More The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an alternative attention mechanism based on a fixed size memory representation that is more efficient. Our technique predicts a compact set of K attention contexts during encoding and lets the decoder compute an efficient lookup that does not need to consult the memory. We show that our approach performs on-par with the standard attention mechanism while yielding inference speedups of 20% for real-world translation tasks and more for tasks with longer sequences. By visualizing attention scores we demonstrate that our models learn distinct, meaningful alignments. △ Less

Submitted 1 July, 2017; originally announced July 2017.

Comments: EMNLP 2017

arXiv:1704.00784 [pdf, other]

Online and Linear-Time Attention by Enforcing Monotonic Alignments

Authors: Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck

Abstract: Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight… ▽ More Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models. △ Less

Submitted 29 June, 2017; v1 submitted 3 April, 2017; originally announced April 2017.

Comments: ICML camera-ready version; 10 pages + 9 page appendix

arXiv:1703.03906 [pdf, other]

Massive Exploration of Neural Machine Translation Architectures

Authors: Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le

Abstract: Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commonly done with other neural network architectures, prohibitiv… ▽ More Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commonly done with other neural network architectures, prohibitively expensive. In this work, we present the first large-scale analysis of NMT architecture hyperparameters. We report empirical results and variance numbers for several hundred experimental runs, corresponding to over 250,000 GPU hours on the standard WMT English to German translation task. Our experiments lead to novel insights and practical advice for building and extending NMT architectures. As part of this contribution, we release an open-source NMT framework that enables researchers to easily experiment with novel techniques and reproduce state of the art results. △ Less

Submitted 21 March, 2017; v1 submitted 10 March, 2017; originally announced March 2017.

Comments: 9 pages, 2 figures, 8 tables, submitted to ACL 2017, open source code at https://github.com/google/seq2seq/

arXiv:1606.09274 [pdf, other]

Compression of Neural Machine Translation Models via Pruning

Authors: Abigail See, Minh-Thang Luong, Christopher D. Manning

Abstract: Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT models, namely class-blind, class-uniform, and class-distribution, which differ in terms of how pruning thresholds are computed for the different classes of weights i… ▽ More Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT models, namely class-blind, class-uniform, and class-distribution, which differ in terms of how pruning thresholds are computed for the different classes of weights in the NMT architecture. We demonstrate the efficacy of weight pruning as a compression technique for a state-of-the-art NMT system. We show that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task. This sheds light on the distribution of redundancy in the NMT architecture. Our main result is that with retraining, we can recover and even surpass the original performance with an 80%-pruned model. △ Less

Submitted 29 June, 2016; originally announced June 2016.

Comments: Accepted to CoNLL 2016. 9 pages plus references

arXiv:1604.00788 [pdf, ps, other]

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models

Authors: Minh-Thang Luong, Christopher D. Manning

Abstract: Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural… ▽ More Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT'15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1-11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words. △ Less

Submitted 22 June, 2016; v1 submitted 4 April, 2016; originally announced April 2016.

Comments: 11pages, 4 figures. ACL 2016 camera-ready version. SOTA WMT'15 English-Czech 20.7 BLEU (+2.1-11.4 points)

arXiv:1511.06114 [pdf, ps, other]

Multi-task Sequence to Sequence Learning

Authors: Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser

Abstract: Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machi… ▽ More Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought. △ Less

Submitted 1 March, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

Comments: 10 pages, 4 figures, ICLR 2016 camera-ready, added parsing SOTA results

arXiv:1508.04025 [pdf, ps, other]

Effective Approaches to Attention-based Neural Machine Translation

Authors: Minh-Thang Luong, Hieu Pham, Christopher D. Manning

Abstract: An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and… ▽ More An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker. △ Less

Submitted 20 September, 2015; v1 submitted 17 August, 2015; originally announced August 2015.

Comments: 11 pages, 7 figures, EMNLP 2015 camera-ready version, more training details

arXiv:1507.01398 [pdf]

doi 10.1063/1.4929783

Nuclear spin noise in NMR revisited

Authors: Guillaume Ferrand, Gaspard Huber, Michel Luong, Hervé Desvaux

Abstract: The theoretical shapes of nuclear spin-noise spectra in NMR are derived by considering a receiver circuit with finite, preamplifier input impedance and a transmission line between the preamplifier and the probe. Using this model, it becomes possible to reproduce all observed experimental features: variation of the NMR resonance linewidth as a function of the transmission line phase, nuclear spin-n… ▽ More The theoretical shapes of nuclear spin-noise spectra in NMR are derived by considering a receiver circuit with finite, preamplifier input impedance and a transmission line between the preamplifier and the probe. Using this model, it becomes possible to reproduce all observed experimental features: variation of the NMR resonance linewidth as a function of the transmission line phase, nuclear spin-noise signals appearing as a "bump" or as a "dip" superimposed on the average electronic noise level even for a spin system and probe at the same temperature, pure in-phase Lorentzian spin-noise signals exhibiting non-vanishing frequency shifts. Extensive comparison to experimental measurements validate the model predictions, and define the conditions for obtaining pure in-phase Lorentzian-shape nuclear spin noise with a vanishing frequency shift, in other words, the conditions for simultaneously obtaining the Spin-Noise and Frequency-Shift Tuning Optima. △ Less

Submitted 22 August, 2015; v1 submitted 6 July, 2015; originally announced July 2015.

arXiv:1506.01057 [pdf, other]

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Authors: Jiwei Li, Minh-Thang Luong, Dan Jurafsky

Abstract: Natural language generation of coherent long texts like paragraphs or longer documents is a challenging problem for recurrent networks models. In this paper, we explore an important step toward this generation task: training an LSTM (Long-short term memory) auto-encoder to preserve and reconstruct multi-sentence paragraphs. We introduce an LSTM model that hierarchically builds an embedding for a p… ▽ More Natural language generation of coherent long texts like paragraphs or longer documents is a challenging problem for recurrent networks models. In this paper, we explore an important step toward this generation task: training an LSTM (Long-short term memory) auto-encoder to preserve and reconstruct multi-sentence paragraphs. We introduce an LSTM model that hierarchically builds an embedding for a paragraph from embeddings for sentences and words, then decodes this embedding to reconstruct the original paragraph. We evaluate the reconstructed paragraph using standard metrics like ROUGE and Entity Grid, showing that neural models are able to encode texts in a way that preserve syntactic, semantic, and discourse coherence. While only a first step toward generating coherent text units from neural models, our work has the potential to significantly impact natural language generation and summarization\footnote{Code for the three models described in this paper can be found at www.stanford.edu/~jiweil/ . △ Less

Submitted 5 June, 2015; v1 submitted 2 June, 2015; originally announced June 2015.

arXiv:1503.00185 [pdf, other]

When Are Tree Structures Necessary for Deep Learning of Representations?

Authors: Jiwei Li, Minh-Thang Luong, Dan Jurafsky, Eudard Hovy

Abstract: Recursive neural models, which use syntactic parse trees to recursively generate representations bottom-up, are a popular architecture. But there have not been rigorous evaluations showing for exactly which tasks this syntax-based method is appropriate. In this paper we benchmark {\bf recursive} neural models against sequential {\bf recurrent} neural models (simple recurrent and LSTM models), enfo… ▽ More Recursive neural models, which use syntactic parse trees to recursively generate representations bottom-up, are a popular architecture. But there have not been rigorous evaluations showing for exactly which tasks this syntax-based method is appropriate. In this paper we benchmark {\bf recursive} neural models against sequential {\bf recurrent} neural models (simple recurrent and LSTM models), enforcing apples-to-apples comparison as much as possible. We investigate 4 tasks: (1) sentiment classification at the sentence level and phrase level; (2) matching questions to answer-phrases; (3) discourse parsing; (4) semantic relation extraction (e.g., {\em component-whole} between nouns). Our goal is to understand better when, and why, recursive models can outperform simpler models. We find that recursive models help mainly on tasks (like semantic relation extraction) that require associating headwords across a long distance, particularly on very long sequences. We then introduce a method for allowing recurrent models to achieve similar performance: breaking long sentences into clause-like units at punctuation and processing them separately before combining. Our results thus help understand the limitations of both classes of models, and suggest directions for improving recurrent models. △ Less

Submitted 18 August, 2015; v1 submitted 28 February, 2015; originally announced March 2015.

Showing 1–50 of 56 results for author: Luong, M