(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 71 results for author: Yang, C H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13912  [pdf, other

    cs.CV

    From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment

    Authors: Yusuke Hirota, Ryo Hachiuma, Chao-Han Huck Yang, Yuta Nakashima

    Abstract: Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-form… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  3. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  4. arXiv:2404.14716  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS

    Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More

    Submitted 16 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  5. arXiv:2402.06894  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

    Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

  6. arXiv:2402.05457  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

    Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

    Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

  7. arXiv:2401.10447  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

    Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  8. arXiv:2401.10446  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by e… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

  9. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  10. arXiv:2312.14378  [pdf, other

    cs.LG cs.SD eess.AS

    Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

    Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

    Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More

    Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

  11. arXiv:2311.12159  [pdf, other

    cs.CV cs.AI cs.IR cs.LG cs.MM

    Conditional Modeling Based Automatic Video Summarization

    Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, Marcel Worring

    Abstract: The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods mainly rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. There are other non-visual factors, such as interestingness, representativeness,… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: This work has been submitted to the IEEE for possible publication. arXiv admin note: substantial text overlap with arXiv:2305.00455

  12. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  13. arXiv:2310.06434  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

    Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

    Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the exis… ▽ More

    Submitted 16 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 as main paper. 10 pages. Revised math notations. GitHub: https://github.com/Srijith-rkr/Whispering-LLaMA

  14. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  15. arXiv:2309.15649  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

    Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

    Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More

    Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  16. arXiv:2309.15223  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

    Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More

    Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  17. arXiv:2309.07081  [pdf, other

    eess.AS cs.CL cs.SD

    Can Whisper perform speech-based in-context learning?

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chi… ▽ More

    Submitted 19 March, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  18. arXiv:2307.01947  [pdf, other

    cs.CV cs.AI cs.IR

    Causal Video Summarizer for Video Exploration

    Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Andrew Brown, Marcel Worring

    Abstract: Recently, video summarization has been proposed as a method to help video exploration. However, traditional video summarization models only generate a fixed video summary which is usually independent of user-specific needs and hence limits the effectiveness of video exploration. Multi-modal video summarization is one of the approaches utilized to address this issue. Multi-modal video summarization… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: This paper is accepted by IEEE International Conference on Multimedia and Expo (ICME), 2022

  19. arXiv:2306.03741   

    quant-ph cs.LG

    Classical-to-Quantum Transfer Learning Facilitates Machine Learning with Variational Quantum Circuit

    Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh, Hector Zenil, Jesper Tegner

    Abstract: While Quantum Machine Learning (QML) is an exciting emerging area, the accuracy of the loss function still needs to be improved by the number of available qubits. Here, we reformulate the QML problem such that the approximation error (representation power) does not depend on the number of qubits. We prove that a classical-to-quantum transfer learning architecture using a Variational Quantum Circui… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 May, 2023; originally announced June 2023.

    Comments: The paper needs a major revision before it could be submitted to a new journal, and the authors agree that the latest version could not be open to public at the moment

  20. arXiv:2306.01015  [pdf, other

    cs.CL cs.NE cs.SD eess.AS

    How to Estimate Model Transferability of Pre-Trained Speech Models?

    Authors: Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath

    Abstract: In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability… ▽ More

    Submitted 5 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech. Code is available at: https://github.com/virginiakm1988/LogME-CTC. Fixed a typo

  21. arXiv:2306.00331  [pdf, other

    eess.AS cs.AI cs.SD eess.SP eess.SY

    A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

    Authors: Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF)… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023. Code will be released at https://github.com/Kuray107/S4ND-U-Net_speech_enhancement

  22. arXiv:2305.16932  [pdf, other

    cs.SD cs.CL eess.AS

    A Neural State-Space Model Approach to Efficient Speech Separation

    Authors: Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

    Abstract: In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we firs… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by InterSpeech 2023

  23. arXiv:2305.11360  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Differentially Private Adapters for Parameter Efficient Acoustic Modeling

    Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi

    Abstract: In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-tra… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023. Code will be available at: https://github.com/Chun-wei-Ho/Private-Speech-Adapter. The authors would like to express their gratitude to Prof. Chin-Hui Lee from Georgia Tech for providing helpful insights and suggestions

  24. arXiv:2305.11320  [pdf, other

    cs.SD cs.AI cs.NE eess.AS eess.SP

    Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

    Authors: Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien

    Abstract: This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  25. arXiv:2305.11244  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

    Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

    Abstract: In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (inpu… ▽ More

    Submitted 3 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023, 5 pages. Code is available at: https://github.com/Srijith-rkr/KAUST-Whisper-Adapter under MIT license

  26. arXiv:2305.00455  [pdf, other

    cs.CV cs.AI

    Causalainer: Causal Explainer for Automatic Video Summarization

    Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, Marcel Worring

    Abstract: The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of th… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: The paper has been accepted by the CVPR Workshop on New Frontiers in Visual Language Reasoning: Compositionality, Prompts, and Causality, 2023

  27. arXiv:2301.07851  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

    Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research

  28. arXiv:2211.01317  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

    Authors: Yun-Ning Hung, Chao-Han Huck Yang, Pin-Yu Chen, Alexander Lerch

    Abstract: Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neur… ▽ More

    Submitted 3 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023. The implementation is available at https://github.com/biboamy/music-repro

  29. arXiv:2211.01263  [pdf, other

    cs.SD cs.LG eess.AS quant-ph

    A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

    Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  30. arXiv:2211.01189  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Inference and Denoise: Causal Inference-based Neural Speech Enhancement

    Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

    Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  31. arXiv:2211.00887  [pdf, other

    quant-ph cs.LG cs.NE eess.SP

    Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise

    Authors: Jhih-Cing Huang, Yu-Lin Tsai, Chao-Han Huck Yang, Cheng-Fang Su, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

    Abstract: Recently, quantum classifiers have been found to be vulnerable to adversarial attacks, in which quantum classifiers are deceived by imperceptible noises, leading to misclassification. In this paper, we propose the first theoretical study demonstrating that adding quantum random rotation noise can improve robustness in quantum classifiers against adversarial attacks. We link the definition of diffe… ▽ More

    Submitted 28 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023

  32. arXiv:2210.06382  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

    Authors: Chao-Han Huck Yang, Jun Qi, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms,… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted to ISCA, ISCSLP 2022, Singapore. 5 Pages

  33. arXiv:2210.05614  [pdf, other

    cs.SD cs.LG cs.NE eess.AS

    An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, I-Fan Chen, Andreas Stolcke, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilit… ▽ More

    Submitted 13 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021

  34. arXiv:2206.04804  [pdf

    quant-ph cs.AI cs.LG cs.NE

    Theoretical Error Performance Analysis for Variational Quantum Circuit Based Functional Regression

    Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh

    Abstract: The noisy intermediate-scale quantum (NISQ) devices enable the implementation of the variational quantum circuit (VQC) for quantum neural networks (QNN). Although the VQC-based QNN has succeeded in many machine learning tasks, the representation and generalization powers of VQC still require further investigation, particularly when the dimensionality of classical inputs is concerned. In this work,… ▽ More

    Submitted 26 October, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Preprint version. 16 pages

  35. arXiv:2203.15529  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Treatment Learning Causal Transformer for Noisy Image Classification

    Authors: Chao-Han Huck Yang, I-Te Danny Hung, Yi-Chieh Liu, Pin-Yu Chen

    Abstract: Current top-notch deep learning (DL) based vision models are primarily based on exploring and exploiting the inherent correlations between training data samples and their associated labels. However, a known practical challenge is their degraded performance against "noisy" data, induced by different circumstances such as spurious correlations, irrelevant contexts, domain shift, and adversarial atta… ▽ More

    Submitted 30 October, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE WACV 2023. The first version was finished in May 2018

  36. arXiv:2203.06031  [pdf, other

    cs.LG cs.AI cs.SD eess.AS

    Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing

    Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Javier Tejedor

    Abstract: This work focuses on designing low complexity hybrid tensor networks by considering trade-offs between the model complexity and practical performance. Firstly, we exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN. Secondly, a hybrid model combining LR-TT-DNN with a convolutional neural network (CNN), which is denoted as CNN… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: 10 pages, 10 Figures

  37. arXiv:2203.04114  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

    Authors: Qing Wang, Jun Du, Siyuan Zheng, Yunqing Li, Yajian Wang, Yuzhong Wu, Hu Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

    Abstract: In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to… ▽ More

    Submitted 31 August, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: 5 pages, 1 figure

  38. arXiv:2203.03550  [pdf, other

    cs.CL cs.AI cs.DC cs.NE eess.AS

    When BERT Meets Quantum Temporal Convolution Learning for Text Classification in Heterogeneous Computing

    Authors: Chao-Han Huck Yang, Jun Qi, Samuel Yen-Chi Chen, Yu Tsao, Pin-Yu Chen

    Abstract: The rapid development of quantum computing has demonstrated many unique characteristics of quantum advantages, such as richer feature representation and more secured protection on model parameters. This work proposes a vertical federated learning architecture based on variational quantum circuits to demonstrate the competitive performance of a quantum-enhanced pre-trained BERT model for text class… ▽ More

    Submitted 17 February, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022

  39. arXiv:2202.08532  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

    Authors: Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu, Joseph Szurley, Roger Ren, Linda Liu, Andreas Stolcke, Ivan Bulyko

    Abstract: In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without ac… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

  40. arXiv:2202.08509  [pdf, other

    cs.SD cs.AI cs.CV cs.LG eess.AS

    A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

    Authors: Hengshun Zhou, Jun Du, Chao-Han Huck Yang, Shifu Xiong, Chin-Hui Lee

    Abstract: Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet an… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. H. Zhou et al

  41. arXiv:2111.14346  [pdf, other

    cs.LG cs.AI cs.CE cs.NE eess.SY

    Pessimistic Model Selection for Offline Deep Reinforcement Learning

    Authors: Chao-Han Huck Yang, Zhengling Qi, Yifan Cui, Pin-Yu Chen

    Abstract: Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: Preprint. A non-archival and preliminary venue was presented at NeurIPS 2021 Offline Reinforcement Learning Workshop

  42. arXiv:2110.08598  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer

    Authors: Hu Hu, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Chin-Hui Lee

    Abstract: We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge num… ▽ More

    Submitted 20 February, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022. Code is available at https://github.com/MihawkHu/ASC_Knowledge_Transfer

  43. arXiv:2110.03894  [pdf, other

    eess.AS cs.AI cs.LG cs.NE cs.SD

    Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition

    Authors: Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao

    Abstract: In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, w… ▽ More

    Submitted 30 October, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Accepted to Interspeech 2023. Code is available at: https://github.com/dodohow1011/SpeechAdvReprogram. Selected as Best Student Paper Candidate

  44. arXiv:2110.03861  [pdf, other

    quant-ph cs.AI cs.CL cs.CV cs.LG cs.NE

    QTN-VQC: An End-to-End Learning framework for Quantum Neural Networks

    Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen

    Abstract: The advent of noisy intermediate-scale quantum (NISQ) computers raises a crucial challenge to design quantum neural networks for fully quantum learning tasks. To bridge the gap, this work proposes an end-to-end learning framework named QTN-VQC, by introducing a trainable quantum tensor network (QTN) for quantum embedding on a variational quantum circuit (VQC). The architecture of QTN is composed o… ▽ More

    Submitted 22 November, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Preprint. A Non-archival and preliminary venue was presented in NeurIPS 2021, Quantum Tensor Networks in Machine Learning Workshop

    Journal ref: Quantum Tensor Networks in Machine Learning Workshop, NeurIPS 2021

  45. arXiv:2107.01461  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

    Authors: Hao Yen, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Qing Wang, Yuyang Wang, Xianjun Xia, Yuanjun Zhao, Yuzhong Wu, Yannan Wang, Jun Du, Chin-Hui Lee

    Abstract: We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model a… ▽ More

    Submitted 1 May, 2022; v1 submitted 3 July, 2021; originally announced July 2021.

    Comments: 5 figures. DCASE 2021. The project started in November 2020. Revised version

  46. arXiv:2106.09296  [pdf, other

    cs.LG cs.AI cs.NE cs.SD eess.AS

    Voice2Series: Reprogramming Acoustic Models for Time Series Classification

    Authors: Chao-Han Huck Yang, Yun-Yun Tsai, Pin-Yu Chen

    Abstract: Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approac… ▽ More

    Submitted 14 January, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Updated version with a correction. The full draft was submitted in Jan 2021. The Voice2Series project initially was launched in Sep 2020. Accepted to ICML 2021, 16 Pages

    Report number: PMLR 139:11808-11819

    Journal ref: Proceedings of the 38th International Conference on Machine Learning 2021

  47. arXiv:2105.14538  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Longer Version for "Deep Context-Encoding Network for Retinal Image Captioning"

    Authors: Jia-Hong Huang, Ting-Wei Wu, Chao-Han Huck Yang, Marcel Worring

    Abstract: Automatically generating medical reports for retinal images is one of the promising ways to help ophthalmologists reduce their workload and improve work efficiency. In this work, we propose a new context-driven encoding network to automatically generate medical reports for retinal images. The proposed model is mainly composed of a multi-modal input encoder and a fused-feature decoder. Our experime… ▽ More

    Submitted 30 May, 2021; originally announced May 2021.

    Comments: This paper is a longer version of "Deep Context-Encoding Network for Retinal Image Captioning" which is accepted by IEEE International Conference on Image Processing (ICIP), 2021

  48. arXiv:2104.01271  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification

    Authors: Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

    Abstract: We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications. The AAE architecture allows us to obtain good synthetic speech leveraging upon a discriminative training of latent vectors. Such synthetic speech is used to build a privacy-pres… ▽ More

    Submitted 15 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021

  49. arXiv:2102.09677  [pdf, other

    cs.LG cs.AI cs.NE cs.RO eess.SY

    Training a Resilient Q-Network against Observational Interference

    Authors: Chao-Han Huck Yang, I-Te Danny Hung, Yi Ouyang, Pin-Yu Chen

    Abstract: Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an… ▽ More

    Submitted 25 January, 2022; v1 submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to AAAI 2022. 9 pages

  50. arXiv:2012.06209  [pdf, other

    cs.IR

    KOSMOS: Knowledge-graph Oriented Social media and Mainstream media Overview System

    Authors: Chua Hao Yang, Yong Shan Jie, Boon Kok Chin, Lander Chin, Lynnette Hui Xian Ng

    Abstract: We introduce KOSMOS, a knowledge retrieval system based on the constructed knowledge graph of social media and mainstream media documents. The system first identifies key events from the documents at each time frame through clustering, extracting a document to represent each cluster, then describing the document in terms of 5W1H (Who, What, When, Where, Why, How). The event centric knowledge graph… ▽ More

    Submitted 17 December, 2020; v1 submitted 11 December, 2020; originally announced December 2020.