(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 67 results for author: Yan, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  2. arXiv:2406.02859   

    eess.AS cs.SD

    ConPCO: Preserving Phoneme Characteristics for Automatic Pronunciation Assessment Leveraging Contrastive Ordinal Regularization

    Authors: Bi-Cheng Yan, Wei-Cheng Chao, Jiun-Ting Li, Yi-Cheng Wang, Hsin-Wei Wang, Meng-Shin Lin, Berlin Chen

    Abstract: Automatic pronunciation assessment (APA) manages to evaluate the pronunciation proficiency of a second language (L2) learner in a target language. Existing efforts typically draw on regression models for proficiency score prediction, where the models are trained to estimate target values without explicitly accounting for phoneme-awareness in the feature space. In this paper, we propose a contrasti… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: This paper has been withdrawn because the authors aim to achieve better organization in writing and more detailed experimental analysis

  3. arXiv:2404.09149  [pdf, other

    eess.SY cs.NE math.NA

    Heuristic Solution to Joint Deployment and Beamforming Design for STAR-RIS Aided Networks

    Authors: Bai Yan, Qi Zhao, Jin Zhang, J. Andrew Zhang

    Abstract: This paper tackles the deployment challenges of Simultaneous Transmitting and Reflecting Reconfigurable Intelligent Surface (STAR-RIS) in communication systems. Unlike existing works that use fixed deployment setups or solely optimize the location, this paper emphasizes the joint optimization of the location and orientation of STAR-RIS. This enables searching across all user grouping possibilities… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: 30 pages

  4. arXiv:2403.12695  [pdf, other

    eess.IV cs.CV cs.LG

    Federated Semi-supervised Learning for Medical Image Segmentation with intra-client and inter-client Consistency

    Authors: Yubin Zheng, Peng Tang, Tianjie Ju, Weidong Qiu, Bo Yan

    Abstract: Medical image segmentation plays a vital role in clinic disease diagnosis and medical image analysis. However, labeling medical images for segmentation task is tough due to the indispensable domain expertise of radiologists. Furthermore, considering the privacy and sensitivity of medical images, it is impractical to build a centralized segmentation dataset from different medical institutions. Fede… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Working in progress

  5. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 16 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  6. arXiv:2311.06079  [pdf

    cs.CV eess.IV

    Enhancing Rock Image Segmentation in Digital Rock Physics: A Fusion of Generative AI and State-of-the-Art Neural Networks

    Authors: Zhaoyang Ma, Xupeng He, Hyung Kwak, Jun Gao, Shuyu Sun, Bicheng Yan

    Abstract: In digital rock physics, analysing microstructures from CT and SEM scans is crucial for estimating properties like porosity and pore connectivity. Traditional segmentation methods like thresholding and CNNs often fall short in accurately detailing rock microstructures and are prone to noise. U-Net improved segmentation accuracy but required many expert-annotated samples, a laborious and error-pron… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  7. arXiv:2310.01839  [pdf

    eess.AS cs.CL cs.SD

    Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss Function for Automatic Pronunciation Assessment

    Authors: Bi-Cheng Yan, Hsin-Wei Wang, Yi-Cheng Wang, Jiun-Ting Li, Chi-Han Lin, Berlin Chen

    Abstract: Automatic pronunciation assessment (APA) manages to quantify the pronunciation proficiency of a second language (L2) learner in a language. Prevailing approaches to APA normally leverage neural models trained with a regression loss function, such as the mean-squared error (MSE) loss, for proficiency level prediction. Despite most regression models can effectively capture the ordinality of proficie… ▽ More

    Submitted 4 October, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  8. arXiv:2309.15826  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

    Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

    Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modal… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  9. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  10. arXiv:2309.15686  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

    Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  11. arXiv:2309.15674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speech collage: code-switched audio generation by collaging monolingual corpora

    Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  12. arXiv:2309.15317  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

    Authors: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more… ▽ More

    Submitted 27 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ASRU 2023

  13. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  14. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  15. arXiv:2309.03520  [pdf, ps, other

    eess.SY

    Deep Reinforcement Learning Enabled Joint Deployment and Beamforming in STAR-RIS Assisted Networks

    Authors: Zhuoyuan Ma, Qi Zhao, Bai Yan, Jin Zhang

    Abstract: In the new generation of wireless communication systems, reconfigurable intelligent surfaces (RIS) and simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) have become competitive network components to achieve intelligent and reconfigurable network environments. However, existing work has not fully studied the deployment freedom of STAR-RIS, which limits furthe… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: 21pages, 7 figures

    MSC Class: G.1.6; I.2.8

  16. arXiv:2308.10157  [pdf, ps, other

    eess.IV cs.CV

    Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

    Authors: Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen

    Abstract: To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternat… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: Accepted and presented in MICCAI 2023. To be published in Proceedings

  17. arXiv:2307.13643  [pdf, other

    cs.CR cs.SD eess.AS

    Backdoor Attacks against Voice Recognition Systems: A Survey

    Authors: Baochen Yan, Jiahe Lan, Zheng Yan

    Abstract: Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and priva… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: 33 pages, 7 figures

  18. arXiv:2307.11005  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

    Authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively int… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted at INTERSPEECH 2023

  19. arXiv:2307.09794  [pdf

    eess.IV cs.CV physics.med-ph

    DiffDP: Radiotherapy Dose Prediction via a Diffusion Model

    Authors: Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang

    Abstract: Currently, deep learning (DL) has achieved the automatic prediction of dose distribution in radiotherapy planning, enhancing its efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L_1 or L_2 loss with posterior average calculations. To alleviate this limitation, we innovatively introduce a diffusion-based dose prediction (DiffDP) model… ▽ More

    Submitted 19 July, 2023; originally announced July 2023.

    Comments: to be published in MICCAI 2023

  20. arXiv:2306.01247  [pdf, other

    eess.AS

    Tensor decomposition for minimization of E2E SLU model toward on-device processing

    Authors: Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: Spoken Language Understanding (SLU) is a critical speech recognition application and is often deployed on edge devices. Consequently, on-device processing plays a significant role in the practical implementation of SLU. This paper focuses on the end-to-end (E2E) SLU model due to its small latency property, unlike a cascade system, and aims to minimize the computational cost. We reduce the model si… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  21. arXiv:2305.18108  [pdf, other

    cs.SD eess.AS

    Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

    Authors: Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued features for downstream tasks, there is potential in exploring alternative approaches that use discretized token sequences. This approach offers benefits such as… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023

  22. arXiv:2305.11095  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

    Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

    Abstract: We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or sim… ▽ More

    Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  23. arXiv:2305.11073  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

    Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe

    Abstract: Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023. Code: https://github.com/espnet/espnet

  24. arXiv:2305.01620  [pdf, ps, other

    cs.CL cs.SD eess.AS

    A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

    Authors: Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing Grand Challenge 2023. We experiment with both end-to-end and pipeline system… ▽ More

    Submitted 6 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: First Place in Track 1 of STOP Challenge, which is part of ICASSP Signal Processing Grand Challenge 2023

  25. arXiv:2305.01194  [pdf, ps, other

    cs.CL cs.SD eess.AS

    The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

    Authors: Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource… ▽ More

    Submitted 11 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: To appear at ICASSP2023

  26. arXiv:2305.00926  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

    Authors: Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context. Each utterance in spoken conversation can be represented by many semantic and speaker attributes, and there has been an interest in building Spoken Language Understanding (SLU) systems for automatically predicting these attributes. Recent work has shown that inc… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 20223

  27. arXiv:2304.13583  [pdf, other

    eess.IV cs.CV

    Multi-Modality Deep Network for Extreme Learned Image Compression

    Authors: Xuhao Jiang, Weimin Tan, Tian Tan, Bo Yan, Liquan Shen

    Abstract: Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years , but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior i… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

    Comments: 13 pages, 14 figures, accepted by AAAI 2023

  28. arXiv:2304.07990  [pdf, other

    eess.SY

    Novel Quality Measure and Efficient Resolution of Convex Hull Pricing for Unit Commitment

    Authors: Mikhail A. Bragin, Farhan Hyder, Bing Yan, Peter B. Luh, Jinye Zhao, Feng Zhao, Dane A. Schiro, Tongxin Zheng

    Abstract: Electricity prices determined by economic dispatch that do not consider fixed costs may lead to significant uplift payments. However, when fixed costs are included, prices become non-monotonic with respect to demand, which can adversely impact market transparency. To overcome this issue, convex hull (CH) pricing has been introduced for unit commitment with fixed costs. Several CH pricing methods h… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

  29. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration

  30. arXiv:2302.12829  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Massively Multilingual ASR With Auxiliary CTC Objectives

    Authors: William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe

    Abstract: Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on im… ▽ More

    Submitted 27 February, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

    Comments: 5 pages, 1 figure, accepted at ICASSP 2023; fixed typo and URL in abstract

  31. arXiv:2212.10818  [pdf, other

    cs.SD cs.CL eess.AS

    4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

    Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

    Abstract: The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models d… ▽ More

    Submitted 29 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Accepted by INTERRSPEECH2023

  32. arXiv:2211.08989  [pdf, other

    cs.CL cs.SD eess.AS

    Avoid Overthinking in Self-Supervised Models for Speech Recognition

    Authors: Dan Berrebbi, Brian Yan, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically red… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  33. arXiv:2211.05967  [pdf, ps, other

    cs.CL eess.AS

    Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

    Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

    Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises f… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  34. arXiv:2211.01458  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-Shot Code-Switched Speech Recognition

    Authors: Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

    Abstract: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, th… ▽ More

    Submitted 9 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: 5 pages

  35. arXiv:2210.16663  [pdf, other

    eess.AS cs.CL

    BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

    Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the… ▽ More

    Submitted 19 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21)

  36. arXiv:2210.15734  [pdf, other

    cs.CL cs.SD eess.AS

    Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

    Authors: Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. We build compositional end-to-end SLU systems that explicitly separate the a… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted at EMNLP 2022 Findings. Our code and models will be publicly available as part of the ESPnet-SLU toolkit: https://github.com/espnet/espnet and the release can be followed here: https://github.com/espnet/espnet/pull/4735

  37. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  38. arXiv:2210.05200  [pdf, other

    cs.CL cs.SD eess.AS

    CTC Alignments Improve Autoregressive Translation

    Authors: Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CT… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  39. arXiv:2207.09514  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

    Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

    Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  40. arXiv:2207.06670  [pdf, other

    cs.CL cs.SD eess.AS

    Two-Pass Low Latency End-to-End Spoken Language Understanding

    Authors: Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

    Abstract: End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches. However, recent work has shown that these models struggle to generalize to new phrasings for the same intent indicating that models cannot understand the semantic content of the given utterance. In this work, we… ▽ More

    Submitted 29 July, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: INTERSPEECH 2022

  41. arXiv:2207.06617  [pdf, other

    eess.IV cs.CV

    Perception-Oriented Stereo Image Super-Resolution

    Authors: Chenxi Ma, Bo Yan, Weimin Tan, Xuhao Jiang

    Abstract: Recent studies of deep learning based stereo image super-resolution (StereoSR) have promoted the development of StereoSR. However, existing StereoSR models mainly concentrate on improving quantitative evaluation metrics and neglect the visual quality of super-resolved stereo images. To improve the perceptual performance, this paper proposes the first perception-oriented stereo image super-resoluti… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: 9 pages, 10 figures, ACM MM 2021

  42. arXiv:2206.12046  [pdf, other

    cs.CV cs.LG eess.IV

    Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution

    Authors: Bo Yan, Leilei Cao, Fengliang Qi, Hongbin Wang

    Abstract: In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper,… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: The second place solution for CVPR2022 PBVS-TISR challenge

  43. arXiv:2204.02470  [pdf, other

    cs.CL cs.SD eess.AS

    Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

    Authors: Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel Lopez-Francisco, Jonathan D. Amith, Shinji Watanabe

    Abstract: Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted… ▽ More

    Submitted 18 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: 5 pages, 2 figures, submitted to Interspeech 2022

  44. arXiv:2111.15016  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

    Authors: Brian Yan, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe, Dong Yu

    Abstract: Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type. In this work, we propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition. By defining the monolingual sub-tasks with label-to-frame synchronization, our joint m… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  45. arXiv:2111.14706  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

    Authors: Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

    Abstract: As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can b… ▽ More

    Submitted 3 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

    Comments: Accepted at ICASSP 2022 (5 pages)

  46. arXiv:2111.03333  [pdf

    cs.CL cs.SD eess.AS

    Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

    Authors: Bi-Cheng Yan, Hsin-Wei Wang, Shih-Hsuan Chiu, Hsuan-Sheng Chiu, Berlin Chen

    Abstract: Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context information with a recurrent neural network or long short-term memory language model (LM) may suffer from the recent bias while excluding the long-range context. In… ▽ More

    Submitted 31 May, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: 6 pages, 6 figures, and 4 tables. Accepted by 2022 International Joint Conference on Neural Networks (IJCNN 2022)

  47. arXiv:2109.12804  [pdf, other

    eess.AS cs.CL cs.SD

    Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

    Authors: Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

    Abstract: The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications be… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

    Comments: Accepted at IEEE ASRU 2021

  48. arXiv:2108.13816  [pdf

    eess.AS

    Maximum F1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech

    Authors: Bi-Cheng Yan, Shao-Wei Fan Jiang, Fu-An Chao, Berlin Chen

    Abstract: End-to-end (E2E) neural models are increasingly attracting attention as a promising modeling approach for mispronunciation detection and diagnosis (MDD). Typically, these models are trained by optimizing a cross-entropy criterion, which corresponds to improving the log-likelihood of the training data. However, there is a discrepancy between the objectives of model training and the MDD evaluation,… ▽ More

    Submitted 9 July, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

    Comments: Accepted by IEEE International Conference on Multimedia and Expo (ICME 2022)

  49. arXiv:2108.11627  [pdf

    cs.MM cs.SD eess.AS

    Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

    Authors: Shao-Wei Fan Jiang, Bi-Cheng Yan, Tien-Hong Lo, Fu-An Chao, Berlin Chen

    Abstract: With the acceleration of globalization, more and more people are willing or required to learn second languages (L2). One of the major remaining challenges facing current mispronunciation and diagnosis (MDD) models for use in computer-assisted pronunciation training (CAPT) is to handle speech from L2 learners with a diverse set of accents. In this paper, we set out to mitigate the adverse effects o… ▽ More

    Submitted 3 October, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

    Comments: Accepted by ASRU 2021

  50. arXiv:2108.00501  [pdf, other

    eess.SP

    A Track-Before-Detect Algorithm for UWB Radar Sensor Networks

    Authors: B. Yan, A. Giorgetti, E. Paolini

    Abstract: Precise localization and tracking of moving non-collaborative persons and objects using a network of ultra-wideband (UWB) radar nodes has been shown to represent a practical and effective approach. In UWB radar sensor networks (RSNs), existence of strong clutter, weak target echoes, and closely spaced targets are obstacles to achieving a satisfactory tracking performance. Using a track-before-dete… ▽ More

    Submitted 1 August, 2021; originally announced August 2021.

    Comments: 39 pages, 20 figures. Accepted in Signal Processing (Elsevier)