(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–26 of 26 results for author: Wei, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.03152  [pdf, other

    eess.AS cs.SD

    MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

    Authors: Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

    Abstract: Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. H… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  2. arXiv:2405.02132  [pdf, other

    cs.SD cs.CL eess.AS

    Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

    Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

    Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More

    Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  3. arXiv:2310.14278  [pdf, other

    cs.SD cs.CL eess.AS

    Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

    Authors: Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

    Abstract: Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the C… ▽ More

    Submitted 27 April, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

    Comments: TASLP

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

  4. arXiv:2307.04630  [pdf, other

    cs.SD eess.AS

    The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

    Authors: Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

    Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

  5. arXiv:2305.17732  [pdf, other

    cs.SD eess.AS

    StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

    Authors: Kun Song, Yi Ren, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie, Xiang Yin, Zejun Ma

    Abstract: Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more p… ▽ More

    Submitted 25 July, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  6. arXiv:2305.02937  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

    Authors: Jixuan Wang, Martin Radfar, Kai Wei, Clement Chung

    Abstract: It is challenging to extract semantic meanings directly from audio signals in spoken language understanding (SLU), due to the lack of textual information. Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic speech recognition (ASR) models to extract textual embeddings as input to infer semantics, which, however, require computationally expensive auto-regressive decoding. In… ▽ More

    Submitted 2 June, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: ICASSP 2023

  7. arXiv:2303.17799  [pdf, other

    cs.CL cs.SD eess.AS

    Dialog act guided contextual adapter for personalized speech recognition

    Authors: Feng-Ju Chang, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Kai Wei, Grant P. Strimel, Ross McGowan

    Abstract: Personalization in multi-turn dialogs has been a long standing challenge for end-to-end automatic speech recognition (E2E ASR) models. Recent work on contextual adapters has tackled rare word recognition using user catalogs. This adaptation, however, does not incorporate an important cue, the dialog act, which is available in a multi-turn dialog scenario. In this work, we propose a dialog act guid… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted at ICASSP 2023

  8. arXiv:2211.05983  [pdf, other

    cs.SD eess.AS

    Acoustic Pornography Recognition Using Convolutional Neural Networks and Bag of Refinements

    Authors: Lifeng Zhou, Kaifeng Wei, Yuke Li, Yiya Hao, Weiqiang Yang, Haoqi Zhu

    Abstract: A large number of pornographic audios publicly available on the Internet seriously threaten the mental and physical health of children, but these audios are rarely detected and filtered. In this paper, we firstly propose a convolutional neural networks (CNN) based model for acoustic pornography recognition. Then, we research a collection of refinements and verify their effectiveness through ablati… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  9. arXiv:2210.17027  [pdf, other

    cs.SD cs.CL eess.AS

    Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

    Authors: Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei

    Abstract: Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speec… ▽ More

    Submitted 30 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  10. Deep Plug-and-Play Prior for Hyperspectral Image Restoration

    Authors: Zeqiang Lai, Kaixuan Wei, Ying Fu

    Abstract: Deep-learning-based hyperspectral image (HSI) restoration methods have gained great popularity for their remarkable performance but often demand expensive network retraining whenever the specifics of task changes. In this paper, we propose to restore HSIs in a unified approach with an effective plug-and-play method, which can jointly retain the flexibility of optimization-based methods and utilize… ▽ More

    Submitted 17 September, 2022; originally announced September 2022.

    Comments: code at https://github.com/Zeqiang-Lai/DPHSIR

    Journal ref: Neurocomputing 481 (2022) 281-293

  11. arXiv:2207.10670  [pdf, other

    cs.LG cs.AI eess.SP

    ME-GAN: Learning Panoptic Electrocardio Representations for Multi-view ECG Synthesis Conditioned on Heart Diseases

    Authors: Jintai Chen, Kuanlun Liao, Kun Wei, Haochao Ying, Danny Z. Chen, Jian Wu

    Abstract: Electrocardiogram (ECG) is a widely used non-invasive diagnostic tool for heart diseases. Many studies have devised ECG analysis models (e.g., classifiers) to assist diagnosis. As an upstream task, researches have built generative models to synthesize ECG data, which are beneficial to providing training samples, privacy protection, and annotation reduction. However, previous generative methods for… ▽ More

    Submitted 29 May, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

    Journal ref: In International Conference on Machine Learning, 3360--3370, (2022), PMLR

  12. arXiv:2207.01039  [pdf, other

    eess.AS cs.CL cs.SD

    Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

    Authors: Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

    Abstract: Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representa… ▽ More

    Submitted 3 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  13. arXiv:2207.00883  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

    Authors: Kun Wei, Pengcheng Guo, Ning Jiang

    Abstract: Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect con… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  14. arXiv:2205.05590  [pdf, other

    cs.CL cs.SD eess.AS

    A neural prosody encoder for end-ro-end dialogue act classification

    Authors: Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Muller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo

    Abstract: Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems. Prosodic features such as energy and pitch have been shown to be useful for DAC. Despite their importance, little research has explored neural approaches to integrate prosodic features into end-to-end (E2E) DAC models which infer dialogue acts directly from audio signals. In this work, we pr… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  15. Intelligent Reflection Enabling Technologies for Integrated and Green Internet-of-Everything Beyond 5G: Communication, Sensing, and Security

    Authors: Wei Shi, Wei Xu, Xiaohu You, Chunming Zhao, Kejun Wei

    Abstract: Internet-of-Everything (IoE) has gradually been recognized as an integral part of future wireless networks. In IoE, there can be an ultra-massive number of smart devices of various types to be served, imposing multi-dimensional requirements on wireless communication, sensing, and security. In this article, we provide a tutorial overview of the promising intelligent reflection communication (IRC) t… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted by IEEE Wireless Communications

  16. arXiv:2204.08171  [pdf, other

    cs.IT eess.SP

    Distributed Neural Precoding for Hybrid mmWave MIMO Communications with Limited Feedback

    Authors: Kai Wei, Jindan Xu, Wei Xu, Ning Wang, Dong Chen

    Abstract: Hybrid precoding is a cost-efficient technique for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) communications. This paper proposes a deep learning approach by using a distributed neural network for hybrid analog-and-digital precoding design with limited feedback. The proposed distributed neural precoding network, called DNet, is committed to achieving two objectives. Fir… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: 13 pages, 4 figures

  17. arXiv:2204.00558  [pdf, other

    cs.CL cs.SD eess.AS

    Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding

    Authors: Xuandi Fu, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P. Strimel, Kanthashree Mysore Sathyendra

    Abstract: End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (N… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: Accepted at ICASSP 2022

  18. arXiv:2202.07855  [pdf, other

    cs.SD cs.CL eess.AS

    Conversational Speech Recognition By Learning Conversation-level Characteristics

    Authors: Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

    Abstract: Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics… ▽ More

    Submitted 17 February, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: ICASSP 2022

  19. arXiv:2108.02158  [pdf, other

    eess.IV cs.CV

    Physics-based Noise Modeling for Extreme Low-light Photography

    Authors: Kaixuan Wei, Ying Fu, Yinqiang Zheng, Jiaolong Yang

    Abstract: Enhancing the visibility in extreme low-light environments is a challenging task. Under nearly lightless condition, existing image denoising methods could easily break down due to significantly low SNR. In this paper, we systematically study the noise statistics in the imaging pipeline of CMOS photosensors, and formulate a comprehensive noise model that can accurately characterize the real noise s… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

    Comments: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI); code is available at https://github.com/Vandermode/ELD. arXiv admin note: substantial text overlap with arXiv:2003.12751

  20. arXiv:2107.11007  [pdf, other

    eess.IV cs.CV

    Dynamic Proximal Unrolling Network for Compressive Imaging

    Authors: Yixiao Yang, Ran Tao, Kaixuan Wei, Ying Fu

    Abstract: Compressive imaging aims to recover a latent image from under-sampled measurements, suffering from a serious ill-posed inverse problem. Recently, deep neural networks have been applied to this problem with superior results, owing to the learned advanced image priors. These approaches, however, require training separate models for different imaging modalities and sampling ratios, leading to overfit… ▽ More

    Submitted 25 October, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

  21. arXiv:2012.05703  [pdf, other

    cs.CV eess.IV

    TFPnP: Tuning-free Plug-and-Play Proximal Algorithm with Applications to Inverse Imaging Problems

    Authors: Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Hua Huang, Carola-Bibiane Schönlieb

    Abstract: Plug-and-Play (PnP) is a non-convex optimization framework that combines proximal algorithms, for example, the alternating direction method of multipliers (ADMM), with advanced denoising priors. Over the past few years, great empirical success has been obtained by PnP algorithms, especially for the ones that integrate deep learning-based denoisers. However, a key challenge of PnP approaches is the… ▽ More

    Submitted 18 September, 2021; v1 submitted 18 November, 2020; originally announced December 2020.

    Comments: The Journal version (47 pages) of arXiv:2002.09611 (ICML'20 Award Paper); Code is released at https://github.com/Vandermode/TFPnP

  22. arXiv:2011.09301  [pdf, other

    cs.SD eess.AS

    Context-aware RNNLM Rescoring for Conversational Speech Recognition

    Authors: Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie

    Abstract: Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new cont… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

  23. arXiv:2010.13956  [pdf, other

    eess.AS cs.SD

    Recent Developments on ESPnet Toolkit Boosted by Conformer

    Authors: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

    Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-… ▽ More

    Submitted 29 October, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

  24. arXiv:2003.12751  [pdf, other

    eess.IV cs.CV

    A Physics-based Noise Formation Model for Extreme Low-light Raw Denoising

    Authors: Kaixuan Wei, Ying Fu, Jiaolong Yang, Hua Huang

    Abstract: Lacking rich and realistic data, learned single image denoising algorithms generalize poorly to real raw images that do not resemble the data used for training. Although the problem can be alleviated by the heteroscedastic Gaussian model for noise synthesis, the noise sources caused by digital camera electronics are still largely overlooked, despite their significant effect on raw measurement, esp… ▽ More

    Submitted 9 April, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR 2020 (oral); code is available at https://github.com/Vandermode/NoiseModel

  25. arXiv:2002.09611  [pdf, other

    eess.IV cs.CV

    Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems

    Authors: Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Schönlieb, Hua Huang

    Abstract: Plug-and-play (PnP) is a non-convex framework that combines ADMM or other proximal algorithms with advanced denoiser priors. Recently, PnP has achieved great empirical success, especially with the integration of deep learning-based denoisers. However, a key problem of PnP based approaches is that they require manual parameter tweaking. It is necessary to obtain high-quality results across the high… ▽ More

    Submitted 18 November, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

  26. arXiv:1911.03150  [pdf, other

    math.NA eess.IV

    Data Driven Tight Frame for Compressed Sensing MRI Reconstruction via Off-the-Grid Regularization

    Authors: Jian-Feng Cai, Jae Kyu Choi, Ke Wei

    Abstract: Recently, the finite-rate-of-innovation (FRI) based continuous domain regularization is emerging as an alternative to the conventional on-the-grid sparse regularization for the compressed sensing (CS) due to its ability to alleviate the basis mismatch between the true support of the shape in the continuous domain and the discrete grid. In this paper, we propose a new off-the-grid regularization fo… ▽ More

    Submitted 7 April, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

    MSC Class: 42B05; 65K15; 65R32; 68U10; 90C90; 92C55; 94A12; 94A20