-
Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
Authors:
Hao Yen,
Pin-Jui Ku,
Sabato Marco Siniscalchi,
Chin-Hui Lee
Abstract:
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable…
▽ More
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting. Experiments on the Multilingual Spoken Words Corpus show comparable performances to character- and phoneme-based SKR in seen languages. The inclusion of domain adversarial training (DAT) improves the proposed framework, outperforming both character- and phoneme-based SKR approaches with 13.73% and 17.22% relative word error rate (WER) reduction in seen languages, and achieves 32.14% and 19.92% WER reduction for unseen languages in zero-shot settings.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Self-Labeling in Multivariate Causality and Quantification for Adaptive Machine Learning
Authors:
Yutian Ren,
Aaron Haohua Yen,
G. P. Li
Abstract:
Adaptive machine learning (ML) aims to allow ML models to adapt to ever-changing environments with potential concept drift after model deployment. Traditionally, adaptive ML requires a new dataset to be manually labeled to tailor deployed models to altered data distributions. Recently, an interactive causality based self-labeling method was proposed to autonomously associate causally related data…
▽ More
Adaptive machine learning (ML) aims to allow ML models to adapt to ever-changing environments with potential concept drift after model deployment. Traditionally, adaptive ML requires a new dataset to be manually labeled to tailor deployed models to altered data distributions. Recently, an interactive causality based self-labeling method was proposed to autonomously associate causally related data streams for domain adaptation, showing promising results compared to traditional feature similarity-based semi-supervised learning. Several unanswered research questions remain, including self-labeling's compatibility with multivariate causality and the quantitative analysis of the auxiliary models used in the self-labeling. The auxiliary models, the interaction time model (ITM) and the effect state detector (ESD), are vital to the success of self-labeling. This paper further develops the self-labeling framework and its theoretical foundations to address these research questions. A framework for the application of self-labeling to multivariate causal graphs is proposed using four basic causal relationships, and the impact of non-ideal ITM and ESD performance is analyzed. A simulated experiment is conducted based on a multivariate causal graph, validating the proposed theory.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Long-Context Language Modeling with Parallel Context Encoding
Authors:
Howard Yen,
Tianyu Gao,
Danqi Chen
Abstract:
Extending large language models (LLMs) to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend…
▽ More
Extending large language models (LLMs) to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, it extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on language modeling and in-context learning. CEPE also excels in retrieval-augmented applications, while existing long-context models degenerate with retrieved contexts. We further introduce a CEPE variant that can extend the context window of instruction-tuned models using only unlabeled data, and showcase its effectiveness on LLAMA-2-CHAT, leading to a strong instruction-following model that can leverage very long contexts on downstream tasks.
△ Less
Submitted 11 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Improving Interpersonal Communication by Simulating Audiences with Language Models
Authors:
Ryan Liu,
Howard Yen,
Raja Marjieh,
Thomas L. Griffiths,
Ranjay Krishna
Abstract:
How do we communicate with others to achieve our goals? We use our prior experience or advice from others, or construct a candidate utterance by predicting how it will be received. However, our experiences are limited and biased, and reasoning about potential outcomes can be difficult and cognitively challenging. In this paper, we explore how we can leverage Large Language Model (LLM) simulations…
▽ More
How do we communicate with others to achieve our goals? We use our prior experience or advice from others, or construct a candidate utterance by predicting how it will be received. However, our experiences are limited and biased, and reasoning about potential outcomes can be difficult and cognitively challenging. In this paper, we explore how we can leverage Large Language Model (LLM) simulations to help us communicate better. We propose the Explore-Generate-Simulate (EGS) framework, which takes as input any scenario where an individual is communicating to an audience with a goal they want to achieve. EGS (1) explores the solution space by producing a diverse set of advice relevant to the scenario, (2) generates communication candidates conditioned on subsets of the advice, and (3) simulates the reactions from various audiences to determine both the best candidate and advice to use. We evaluate the framework on eight scenarios spanning the ten fundamental processes of interpersonal communication. For each scenario, we collect a dataset of human evaluations across candidates and baselines, and showcase that our framework's chosen candidate is preferred over popular generation mechanisms including Chain-of-Thought. We also find that audience simulations achieve reasonably high agreement with human raters across 5 of the 8 scenarios. Finally, we demonstrate the generality of our framework by applying it to real-world scenarios described by users on web forums. Through evaluations and demonstrations, we show that EGS enhances the effectiveness and outcomes of goal-oriented communication across a variety of situations, thus opening up new possibilities for the application of large language models in revolutionizing communication and decision-making processes.
△ Less
Submitted 3 November, 2023; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
Authors:
Hao Yen,
Sabato Marco Siniscalchi,
Chin-Hui Lee
Abstract:
We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme map…
▽ More
We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme mapping matrices are constructed based on the predefined set of universal attribute inventory, which projects the knowledge-rich articulatory attribute logits, into output phoneme logits. The mapping puts knowledge-based constraints to limit inconsistency with acoustic-phonetic evidence in the integrated prediction. Combined with phoneme recognition, our phone recognizer is able to infer from both attribute and phoneme information. The proposed joint multilingual model is evaluated through phoneme recognition. In multilingual experiments over 6 languages on benchmark datasets LibriSpeech and CommonVoice, we find that our proposed solution outperforms conventional multilingual approaches with a relative improvement of 6.85% on average, and it also demonstrates a much better performance compared to monolingual model. Further analysis conclusively demonstrates that the proposed solution eliminates phoneme predictions that are inconsistent with attributes.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction
Authors:
Su-Kai Chen,
Hung-Lin Yen,
Yu-Lun Liu,
Min-Hung Chen,
Hou-Ning Hu,
Wen-Hsiao Peng,
Yen-Yu Lin
Abstract:
Deep learning is commonly used to reconstruct HDR images from LDR images. LDR stack-based methods are used for single-image HDR reconstruction, generating an HDR image from a deep learning-generated LDR stack. However, current methods generate the stack with predetermined exposure values (EVs), which may limit the quality of HDR reconstruction. To address this, we propose the continuous exposure v…
▽ More
Deep learning is commonly used to reconstruct HDR images from LDR images. LDR stack-based methods are used for single-image HDR reconstruction, generating an HDR image from a deep learning-generated LDR stack. However, current methods generate the stack with predetermined exposure values (EVs), which may limit the quality of HDR reconstruction. To address this, we propose the continuous exposure value representation (CEVR), which uses an implicit function to generate LDR images with arbitrary EVs, including those unseen during training. Our approach generates a continuous stack with more images containing diverse EVs, significantly improving HDR reconstruction. We use a cycle training strategy to supervise the model in generating continuous EV LDR images without corresponding ground truths. Our CEVR model outperforms existing methods, as demonstrated by experimental results.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Enabling Large Language Models to Generate Text with Citations
Authors:
Tianyu Gao,
Howard Yen,
Jiatong Yu,
Danqi Chen
Abstract:
Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare di…
▽ More
Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
△ Less
Submitted 31 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Wizundry: A Cooperative Wizard of Oz Platform for Simulating Future Speech-based Interfaces with Multiple Wizards
Authors:
Siying Hu,
Hen Chen Yen,
Ziwei Yu,
Mingjian Zhao,
Katie Seaborn,
Can Liu
Abstract:
Wizard of Oz (WoZ) as a prototyping method has been used to simulate intelligent user interfaces, particularly for speech-based systems. However, as our societies' expectations on artificial intelligence (AI) grows, the question remains whether a single Wizard is sufficient for it to simulate smarter systems and more complex interactions. Optimistic visions of 'what artificial intelligence (AI) ca…
▽ More
Wizard of Oz (WoZ) as a prototyping method has been used to simulate intelligent user interfaces, particularly for speech-based systems. However, as our societies' expectations on artificial intelligence (AI) grows, the question remains whether a single Wizard is sufficient for it to simulate smarter systems and more complex interactions. Optimistic visions of 'what artificial intelligence (AI) can do' places demands on WoZ platforms to simulate smarter systems and more complex interactions. This raises the question of whether the typical approach of employing a single Wizard is sufficient. Moreover, while existing work has employed multiple Wizards in WoZ studies, a multi-Wizard approach has not been systematically studied in terms of feasibility, effectiveness, and challenges. We offer Wizundry, a real-time, web-based WoZ platform that allows multiple Wizards to collaboratively operate a speech-to-text based system remotely. We outline the design and technical specifications of our open-source platform, which we iterated over two design phases. We report on two studies in which participant-Wizards were tasked with negotiating how to cooperatively simulate an interface that can handle natural speech for dictation and text editing as well as other intelligent text processing tasks. We offer qualitative findings on the Multi-Wizard experience for Dyads and Triads of Wizards. Our findings reveal the promises and challenges of the multi-Wizard approach and open up new research questions.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Cold Diffusion for Speech Enhancement
Authors:
Hao Yen,
François G. Germain,
Gordon Wichern,
Jonathan Le Roux
Abstract:
Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties o…
▽ More
Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties of the sampling process from cold diffusion could be utilized to restore high-quality samples from arbitrary degradations. Based on these properties, we propose an improved training algorithm and objective to help the model generalize better during the sampling process. We verify our proposed framework by investigating two model architectures. Experimental results on benchmark speech enhancement dataset VoiceBank-DEMAND demonstrate the strong performance of the proposed approach compared to representative discriminative models and diffusion-based enhancement models.
△ Less
Submitted 23 May, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings
Authors:
Hao Yen,
Woojay Jeon
Abstract:
In embedding-matching acoustic-to-word (A2W) ASR, every word in the vocabulary is represented by a fixed-dimension embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution for the dynamic out-of-vocabulary (OOV) words problem, where speaker- and context-dependent named entities like contact names must be incorporated int…
▽ More
In embedding-matching acoustic-to-word (A2W) ASR, every word in the vocabulary is represented by a fixed-dimension embedding vector that can be added or removed independently of the rest of the system. The approach is potentially an elegant solution for the dynamic out-of-vocabulary (OOV) words problem, where speaker- and context-dependent named entities like contact names must be incorporated into the ASR on-the-fly for every speech utterance at testing time. Challenges still remain, however, in improving the overall accuracy of embedding-matching A2W. In this paper, we contribute two methods that improve the accuracy of embedding-matching A2W. First, we propose internally producing multiple embeddings, instead of a single embedding, at each instance in time, which allows the A2W model to propose a richer set of hypotheses over multiple time segments in the audio. Second, we propose using word pronunciation embeddings rather than word orthography embeddings to reduce ambiguities introduced by words that have more than one sound. We show that the above ideas give significant accuracy improvement, with the same training data and nearly identical model size, in scenarios where dynamic OOV words play a crucial role. On a dataset of queries to a speech-based digital assistant that include many user-dependent contact names, we observe up to 18% decrease in word error rate using the proposed improvements.
△ Less
Submitted 19 February, 2023; v1 submitted 29 October, 2022;
originally announced October 2022.
-
A Summary of the ALQAC 2021 Competition
Authors:
Nguyen Ha Thanh,
Bui Minh Quan,
Chau Nguyen,
Tung Le,
Nguyen Minh Phuong,
Dang Tran Binh,
Vuong Thi Hai Yen,
Teeradaj Racharak,
Nguyen Le Minh,
Tran Duc Vu,
Phan Viet Anh,
Nguyen Truong Son,
Huy Tien Nguyen,
Bhumindr Butr-indr,
Peerapon Vateekul,
Prachya Boonkwan
Abstract:
We summarize the evaluation of the first Automated Legal Question Answering Competition (ALQAC 2021). The competition this year contains three tasks, which aims at processing the statute law document, which are Legal Text Information Retrieval (Task 1), Legal Text Entailment Prediction (Task 2), and Legal Text Question Answering (Task 3). The final goal of these tasks is to build a system that can…
▽ More
We summarize the evaluation of the first Automated Legal Question Answering Competition (ALQAC 2021). The competition this year contains three tasks, which aims at processing the statute law document, which are Legal Text Information Retrieval (Task 1), Legal Text Entailment Prediction (Task 2), and Legal Text Question Answering (Task 3). The final goal of these tasks is to build a system that can automatically determine whether a particular statement is lawful. There is no limit to the approaches of the participating teams. This year, there are 5 teams participating in Task 1, 6 teams participating in Task 2, and 5 teams participating in Task 3. There are in total 36 runs submitted to the organizer. In this paper, we summarize each team's approaches, official results, and some discussion about the competition. Only results of the teams who successfully submit their approach description paper are reported in this paper.
△ Less
Submitted 24 April, 2022; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC Challenge
Authors:
Chia-Yen Lee,
Hsiang-Chin Chien,
Ching-Ping Wang,
Hong Yen,
Kai-Wen Zhen,
Hong-Kun Lin
Abstract:
Colorectal cancer is one of the most common cancers worldwide, so early pathological examination is very important. However, it is time-consuming and labor-intensive to identify the number and type of cells on H&E images in clinical. Therefore, automatic segmentation and classification task and counting the cellular composition of H&E images from pathological sections is proposed by CoNIC Challeng…
▽ More
Colorectal cancer is one of the most common cancers worldwide, so early pathological examination is very important. However, it is time-consuming and labor-intensive to identify the number and type of cells on H&E images in clinical. Therefore, automatic segmentation and classification task and counting the cellular composition of H&E images from pathological sections is proposed by CoNIC Challenge 2022. We proposed a multi-scale Swin transformer with HTC for this challenge, and also applied the known normalization methods to generate more augmentation data. Finally, our strategy showed that the multi-scale played a crucial role to identify different scale features and the augmentation arose the recognition of model.
△ Less
Submitted 16 April, 2024; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
Authors:
Shaoyi Huang,
Dongkuan Xu,
Ian E. H. Yen,
Yijue Wang,
Sung-en Chang,
Bingbing Li,
Shiyang Chen,
Mimi Xie,
Sanguthevar Rajasekaran,
Hang Liu,
Caiwen Ding
Abstract:
Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to addre…
▽ More
Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.
△ Less
Submitted 16 January, 2023; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition
Authors:
Hao Yen,
Pin-Jui Ku,
Chao-Han Huck Yang,
Hu Hu,
Sabato Marco Siniscalchi,
Pin-Yu Chen,
Yu Tsao
Abstract:
In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, w…
▽ More
In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data.
△ Less
Submitted 30 October, 2023; v1 submitted 8 October, 2021;
originally announced October 2021.
-
A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification
Authors:
Hao Yen,
Chao-Han Huck Yang,
Hu Hu,
Sabato Marco Siniscalchi,
Qing Wang,
Yuyang Wang,
Xianjun Xia,
Yuanjun Zhao,
Yuzhong Wu,
Yannan Wang,
Jun Du,
Chin-Hui Lee
Abstract:
We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model a…
▽ More
We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called \emph{Acoustic Lottery}. Acoustic Lottery could compress an ASC model up to $1/10^{4}$ and attain a superior performance (validation accuracy of 79.4% and Log loss of 0.64) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.
△ Less
Submitted 1 May, 2022; v1 submitted 3 July, 2021;
originally announced July 2021.
-
Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
Authors:
Dongkuan Xu,
Ian E. H. Yen,
Jinxi Zhao,
Zhibin Xiao
Abstract:
Transformer-based pre-trained language models have significantly improved the performance of various natural language processing (NLP) tasks in the recent years. While effective and prevalent, these models are usually prohibitively large for resource-limited deployment scenarios. A thread of research has thus been working on applying network pruning techniques under the pretrain-then-finetune para…
▽ More
Transformer-based pre-trained language models have significantly improved the performance of various natural language processing (NLP) tasks in the recent years. While effective and prevalent, these models are usually prohibitively large for resource-limited deployment scenarios. A thread of research has thus been working on applying network pruning techniques under the pretrain-then-finetune paradigm widely adopted in NLP. However, the existing pruning results on benchmark transformers, such as BERT, are not as remarkable as the pruning results in the literature of convolutional neural networks (CNNs). In particular, common wisdom in pruning CNN states that sparse pruning technique compresses a model more than that obtained by reducing number of channels and layers (Elsen et al., 2020; Zhu and Gupta, 2017), while existing works on sparse pruning of BERT yields inferior results than its small-dense counterparts such as TinyBERT (Jiao et al., 2020). In this work, we aim to fill this gap by studying how knowledge are transferred and lost during the pre-train, fine-tune, and pruning process, and proposing a knowledge-aware sparse pruning process that achieves significantly superior results than existing literature. We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers. Experiments on multiple data sets of GLUE benchmark show that our method outperforms the leading competitors with a 20-times weight/FLOPs compression and neglectable loss in prediction accuracy.
△ Less
Submitted 16 January, 2022; v1 submitted 17 April, 2021;
originally announced April 2021.
-
Minimizing FLOPs to Learn Efficient Sparse Representations
Authors:
Biswajit Paria,
Chih-Kuan Yeh,
Ian E. H. Yen,
Ning Xu,
Pradeep Ravikumar,
Barnabás Póczos
Abstract:
Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, an…
▽ More
Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.
△ Less
Submitted 12 April, 2020;
originally announced April 2020.
-
Representer Point Selection for Explaining Deep Neural Networks
Authors:
Chih-Kuan Yeh,
Joon Sik Kim,
Ian E. H. Yen,
Pradeep Ravikumar
Abstract:
We propose to explain the predictions of a deep neural network, by pointing to the set of what we call representer points in the training set, for a given test point prediction. Specifically, we show that we can decompose the pre-activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what we call representer values,…
▽ More
We propose to explain the predictions of a deep neural network, by pointing to the set of what we call representer points in the training set, for a given test point prediction. Specifically, we show that we can decompose the pre-activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what we call representer values, which thus capture the importance of that training point on the learned parameters of the network. But it provides a deeper understanding of the network than simply training point influence: with positive representer values corresponding to excitatory training points, and negative values corresponding to inhibitory points, which as we show provides considerably more insight. Our method is also much more scalable, allowing for real-time feedback in a manner not feasible with influence functions.
△ Less
Submitted 23 November, 2018;
originally announced November 2018.
-
Word Mover's Embedding: From Word2Vec to Document Embedding
Authors:
Lingfei Wu,
Ian E. H. Yen,
Kun Xu,
Fangli Xu,
Avinash Balakrishnan,
Pin-Yu Chen,
Pradeep Ravikumar,
Michael J. Witbrock
Abstract:
While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called \emph{Word Mover's Distance} (WMD) that aligns semantically similar words, yields unprecedented KNN classif…
▽ More
While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called \emph{Word Mover's Distance} (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. In this paper, we propose the \emph{Word Mover's Embedding } (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques, with significantly higher accuracy on problems of short length.
△ Less
Submitted 30 October, 2018;
originally announced November 2018.
-
Efficient Tensor Decomposition with Boolean Factors
Authors:
Sung-En Chang,
Xun Zheng,
Ian E. H. Yen,
Pradeep Ravikumar,
Rose Yu
Abstract:
Tensor decomposition has been extensively used as a tool for exploratory analysis. Motivated by neuroscience applications, we study tensor decomposition with Boolean factors. The resulting optimization problem is challenging due to the non-convex objective and the combinatorial constraints. We propose Binary Matching Pursuit (BMP), a novel generalization of the matching pursuit strategy to decompo…
▽ More
Tensor decomposition has been extensively used as a tool for exploratory analysis. Motivated by neuroscience applications, we study tensor decomposition with Boolean factors. The resulting optimization problem is challenging due to the non-convex objective and the combinatorial constraints. We propose Binary Matching Pursuit (BMP), a novel generalization of the matching pursuit strategy to decompose the tensor efficiently. BMP iteratively searches for atoms in a greedy fashion. The greedy atom search step is solved efficiently via a MAXCUT-like boolean quadratic program. We prove that BMP is guaranteed to converge sublinearly to the optimal solution and recover the factors under mild identifiability conditions. Experiments demonstrate the superior performance of our method over baselines on synthetic and real datasets. We also showcase the application of BMP in quantifying neural interactions underlying high-resolution spatiotemporal ECoG recordings.
△ Less
Submitted 11 November, 2020; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Revisiting Random Binning Features: Fast Convergence and Strong Parallelizability
Authors:
Lingfei Wu,
Ian E. H. Yen,
Jie Chen,
Rui Yan
Abstract:
Kernel method has been developed as one of the standard approaches for nonlinear learning, which however, does not scale to large data set due to its quadratic complexity in the number of samples. A number of kernel approximation methods have thus been proposed in the recent years, among which the random features method gains much popularity due to its simplicity and direct reduction of nonlinear…
▽ More
Kernel method has been developed as one of the standard approaches for nonlinear learning, which however, does not scale to large data set due to its quadratic complexity in the number of samples. A number of kernel approximation methods have thus been proposed in the recent years, among which the random features method gains much popularity due to its simplicity and direct reduction of nonlinear problem to a linear one. The Random Binning (RB) feature, proposed in the first random-feature paper \cite{rahimi2007random}, has drawn much less attention than the Random Fourier (RF) feature. In this work, we observe that the RB features, with right choice of optimization solver, could be orders-of-magnitude more efficient than other random features and kernel approximation methods under the same requirement of accuracy. We thus propose the first analysis of RB from the perspective of optimization, which by interpreting RB as a Randomized Block Coordinate Descent in the infinite-dimensional space, gives a faster convergence rate compared to that of other random features. In particular, we show that by drawing $R$ random grids with at least $κ$ number of non-empty bins per grid in expectation, RB method achieves a convergence rate of $O(1/(κR))$, which not only sharpens its $O(1/\sqrt{R})$ rate from Monte Carlo analysis, but also shows a $κ$ times speedup over other random features under the same analysis framework. In addition, we demonstrate another advantage of RB in the L1-regularized setting, where unlike other random features, a RB-based Coordinate Descent solver can be parallelized with guaranteed speedup proportional to $κ$. Our extensive experiments demonstrate the superior performance of the RB features over other random features and kernel approximation methods. Our code and data is available at { \url{https://github.com/teddylfwu/RB_GEN}}.
△ Less
Submitted 18 September, 2018; v1 submitted 14 September, 2018;
originally announced September 2018.
-
Complexity Analysis of Balloon Drawing for Rooted Trees
Authors:
Chun-Cheng Lin,
Hsu-Chun Yen,
Sheung-Hung Poon,
Jia-Hao Fan
Abstract:
In a balloon drawing of a tree, all the children under the same parent are placed on the circumference of the circle centered at their parent, and the radius of the circle centered at each node along any path from the root reflects the number of descendants associated with the node. Among various styles of tree drawings reported in the literature, the balloon drawing enjoys a desirable feature of…
▽ More
In a balloon drawing of a tree, all the children under the same parent are placed on the circumference of the circle centered at their parent, and the radius of the circle centered at each node along any path from the root reflects the number of descendants associated with the node. Among various styles of tree drawings reported in the literature, the balloon drawing enjoys a desirable feature of displaying tree structures in a rather balanced fashion. For each internal node in a balloon drawing, the ray from the node to each of its children divides the wedge accommodating the subtree rooted at the child into two sub-wedges. Depending on whether the two sub-wedge angles are required to be identical or not, a balloon drawing can further be divided into two types: even sub-wedge and uneven sub-wedge types. In the most general case, for any internal node in the tree there are two dimensions of freedom that affect the quality of a balloon drawing: (1) altering the order in which the children of the node appear in the drawing, and (2) for the subtree rooted at each child of the node, flipping the two sub-wedges of the subtree. In this paper, we give a comprehensive complexity analysis for optimizing balloon drawings of rooted trees with respect to angular resolution, aspect ratio and standard deviation of angles under various drawing cases depending on whether the tree is of even or uneven sub-wedge type and whether (1) and (2) above are allowed. It turns out that some are NP-complete while others can be solved in polynomial time. We also derive approximation algorithms for those that are intractable in general.
△ Less
Submitted 14 April, 2010;
originally announced April 2010.
-
On the Ramsey Numbers for Bipartite Multigraphs
Authors:
Ming-Yang Chen,
Hsueh-I. Lu,
Hsu-Chun Yen
Abstract:
A coloring of a complete bipartite graph is shuffle-preserved if it is the case that assigning a color $c$ to edges $(u, v)$ and $(u', v')$ enforces the same color assignment for edges $(u, v')$ and $(u',v)$. (In words, the induced subgraph with respect to color $c$ is complete.) In this paper, we investigate a variant of the Ramsey problem for the class of complete bipartite multigraphs. (By a…
▽ More
A coloring of a complete bipartite graph is shuffle-preserved if it is the case that assigning a color $c$ to edges $(u, v)$ and $(u', v')$ enforces the same color assignment for edges $(u, v')$ and $(u',v)$. (In words, the induced subgraph with respect to color $c$ is complete.) In this paper, we investigate a variant of the Ramsey problem for the class of complete bipartite multigraphs. (By a multigraph we mean a graph in which multiple edges, but no loops, are allowed.) Unlike the conventional m-coloring scheme in Ramsey theory which imposes a constraint (i.e., $m$) on the total number of colors allowed in a graph, we introduce a relaxed version called m-local coloring which only requires that, for every vertex $v$, the number of colors associated with $v$'s incident edges is bounded by $m$. Note that the number of colors found in a graph under $m$-local coloring may exceed m. We prove that given any $n \times n$ complete bipartite multigraph $G$, every shuffle-preserved $m$-local coloring displays a monochromatic copy of $K_{p,p}$ provided that $2(p-1)(m-1) < n$. Moreover, the above bound is tight when (i) $m=2$, or (ii) $n=2^k$ and $m=3\cdot 2^{k-2}$ for every integer $k\geq 2$. As for the lower bound of $p$, we show that the existence of a monochromatic $K_{p,p}$ is not guaranteed if $p> \lceil \frac{n}{m} \rceil$. Finally, we give a generalization for $k$-partite graphs and a method applicable to general graphs. Many conclusions found in $m$-local coloring can be inferred to similar results of $m$-coloring.
△ Less
Submitted 12 May, 2003;
originally announced May 2003.
-
Compact Floor-Planning via Orderly Spanning Trees
Authors:
Chien-Chih Liao,
Hsueh-I Lu,
Hsu-Chun Yen
Abstract:
Floor-planning is a fundamental step in VLSI chip design. Based upon the concept of orderly spanning trees, we present a simple O(n)-time algorithm to construct a floor-plan for any n-node plane triangulation. In comparison with previous floor-planning algorithms in the literature, our solution is not only simpler in the algorithm itself, but also produces floor-plans which require fewer module…
▽ More
Floor-planning is a fundamental step in VLSI chip design. Based upon the concept of orderly spanning trees, we present a simple O(n)-time algorithm to construct a floor-plan for any n-node plane triangulation. In comparison with previous floor-planning algorithms in the literature, our solution is not only simpler in the algorithm itself, but also produces floor-plans which require fewer module types. An equally important aspect of our new algorithm lies in its ability to fit the floor-plan area in a rectangle of size (n-1)x(2n+1)/3. Lower bounds on the worst-case area for floor-planning any plane triangulation are also provided in the paper.
△ Less
Submitted 4 May, 2003; v1 submitted 17 October, 2002;
originally announced October 2002.