(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 96 results for author: Yan, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.20262  [pdf

    eess.SP

    A Neural-Network-Embedded Equivalent Circuit Model for Lithium-ion Battery State Estimation

    Authors: Zelin Guo, Yiyan Li, Zheng Yan, Mo-Yuen Chow

    Abstract: Equivalent Circuit Model(ECM)has been widelyused in battery modeling and state estimation because of itssimplicity, stability and interpretability.However, ECM maygenerate large estimation errors in extreme working conditionssuch as freezing environmenttemperature andcomplexcharging/discharging behaviors,in whichscenariostheelectrochemical characteristics of the battery become extremelycomplex and… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: 8 pages

  2. arXiv:2407.13691  [pdf, other

    eess.SP

    Unsupervised and Interpretable Synthesizing for Electrical Time Series Based on Information Maximizing Generative Adversarial Nets

    Authors: Zhenghao Zhou, Yiyan Li, Runlong Liu, Zheng Yan, Mo-Yuen Chow

    Abstract: Generating synthetic data has become a popular alternative solution to deal with the difficulties in accessing and sharing field measurement data in power systems. However, to make the generation results controllable, existing methods (e.g. Conditional Generative Adversarial Nets, cGAN) require labeled dataset to train the model, which is demanding in practice because many field measurement data l… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  3. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  4. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  5. arXiv:2406.18361  [pdf, other

    cs.CV cs.AI eess.IV

    Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

    Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

    Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

  6. arXiv:2406.13275  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

    Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

    Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More

    Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  7. arXiv:2406.07012  [pdf, other

    cs.SD cs.CL eess.AS

    Bridging Language Gaps in Audio-Text Retrieval

    Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More

    Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: interspeech2024

  8. arXiv:2406.06992  [pdf, other

    cs.SD eess.AS

    Scaling up masked audio encoder learning for general audio classification

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  9. arXiv:2406.06543  [pdf, other

    cs.AR cs.LG cs.NE eess.SP

    SparrowSNN: A Hardware/software Co-design for Energy Efficient ECG Classification

    Authors: Zhanglu Yan, Zhenyu Bai, Tulika Mitra, Weng-Fai Wong

    Abstract: Heart disease is one of the leading causes of death worldwide. Given its high risk and often asymptomatic nature, real-time continuous monitoring is essential. Unlike traditional artificial neural networks (ANNs), spiking neural networks (SNNs) are well-known for their energy efficiency, making them ideal for wearable devices and energy-constrained edge computing platforms. However, current energy… ▽ More

    Submitted 6 May, 2024; originally announced June 2024.

  10. arXiv:2405.17818  [pdf, other

    cs.CV eess.IV

    Hyperspectral and multispectral image fusion with arbitrary resolution through self-supervised representations

    Authors: Ting Wang, Zipei Yan, Jizhou Li, Xile Zhao, Chao Wang, Michael Ng

    Abstract: The fusion of a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) has emerged as an effective technique for achieving HSI super-resolution (SR). Previous studies have mainly concentrated on estimating the posterior distribution of the latent high-resolution hyperspectral image (HR-HSI), leveraging an appropriate image prior and likelihood computed from… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  11. arXiv:2405.11520  [pdf, other

    cs.IT eess.SP

    On Performance of FAS-aided Wireless Powered NOMA Communication Systems

    Authors: Farshad Rostami Ghadi, Masoud Kaveh, Kai-Kit Wong, Riku Jantti, Zheng Yan

    Abstract: This paper studies the performance of a wireless powered communication network (WPCN) under the non-orthogonal multiple access (NOMA) scheme, where users take advantage of an emerging fluid antenna system (FAS). More precisely, we consider a scenario where a transmitter is powered by a remote power beacon (PB) to send information to the planar NOMA FAS-equipped users through Rayleigh fading channe… ▽ More

    Submitted 8 August, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

    Comments: This manuscript has been submitted to the 20th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)

  12. arXiv:2404.13786  [pdf, other

    eess.SY cs.AI cs.DC cs.LG

    Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

    Authors: Shuyao Shi, Neiwen Ling, Zhehao Jiang, Xuan Huang, Yuze He, Xiaoguang Zhao, Bufang Yang, Chen Bian, Jingfei Xia, Zhenyu Yan, Raymond Yeung, Guoliang Xing

    Abstract: Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components ca… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  13. arXiv:2403.20075  [pdf, ps, other

    cs.LG eess.SY

    Adaptive Decentralized Federated Learning in Energy and Latency Constrained Wireless Networks

    Authors: Zhigang Yan, Dong Li

    Abstract: In Federated Learning (FL), with parameter aggregated by a central node, the communication overhead is a substantial concern. To circumvent this limitation and alleviate the single point of failure within the FL framework, recent studies have introduced Decentralized Federated Learning (DFL) as a viable alternative. Considering the device heterogeneity, and energy cost associated with parameter ag… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  14. arXiv:2403.10573  [pdf, other

    eess.IV cs.CR cs.CV cs.LG

    Medical Unlearnable Examples: Securing Medical Data from Unauthorized Training via Sparsity-Aware Local Masking

    Authors: Weixiang Sun, Yixin Liu, Zhiling Yan, Kaidi Xu, Lichao Sun

    Abstract: The rapid expansion of AI in healthcare has led to a surge in medical data generation and storage, boosting medical AI development. However, fears of unauthorized use, like training commercial AI models, hinder researchers from sharing their valuable datasets. To encourage data sharing, one promising solution is to introduce imperceptible noise into the data. This method aims to safeguard the data… ▽ More

    Submitted 7 July, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: Accept by ICML 2024 NextGenAISafety

  15. arXiv:2401.00766  [pdf, other

    cs.CV eess.IV

    Exposure Bracketing is All You Need for Unifying Image Restoration and Enhancement Tasks

    Authors: Zhilu Zhang, Shuohao Zhang, Renlong Wu, Zifei Yan, Wangmeng Zuo

    Abstract: It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple ima… ▽ More

    Submitted 31 May, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

    Comments: 21 pages

  16. arXiv:2312.14860  [pdf, other

    cs.SD eess.AS

    Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

    Authors: Lingyun Zuo, Keyu An, Shiliang Zhang, Zhijie Yan

    Abstract: In a speech recognition system, voice activity detection (VAD) is a crucial frontend module. Addressing the issues of poor noise robustness in traditional binary VAD systems based on DFSMN, the paper further proposes semantic VAD based on multi-task learning with improved models for real-time and offline systems, to meet specific application requirements. Evaluations on internal datasets show that… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  17. arXiv:2311.07919  [pdf, other

    eess.AS cs.CL cs.LG

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Authors: Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou

    Abstract: Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-… ▽ More

    Submitted 21 December, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: The code, checkpoints and demo are released at https://github.com/QwenLM/Qwen-Audio

  18. arXiv:2310.04673  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

    Authors: Zhihao Du, Jiaming Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

    Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as a… ▽ More

    Submitted 2 July, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 10 pages, work in progress

  19. arXiv:2309.13573  [pdf, other

    cs.SD eess.AS

    The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

    Authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu

    Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tr… ▽ More

    Submitted 5 October, 2023; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: 8 pages, Accepted by ASRU2023

  20. arXiv:2309.05674  [pdf, other

    eess.IV cs.CV

    ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

    Authors: Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Zheng, Li Yu

    Abstract: Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence p… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: Accepted by MICCAI 2023

  21. arXiv:2309.04132  [pdf, other

    cs.SD eess.AS

    A Two-Stage Training Framework for Joint Speech Compression and Enhancement

    Authors: Jiayi Huang, Zeyu Yan, Wenbin Jiang, Fei Wen

    Abstract: This paper considers the joint compression and enhancement problem for speech signal in the presence of noise. Recently, the SoundStream codec, which relies on end-to-end joint training of an encoder-decoder pair and a residual vector quantizer by a combination of adversarial and reconstruction losses,has shown very promising performance, especially in subjective perception quality. In this work,… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  22. arXiv:2308.11957  [pdf, other

    cs.SD eess.AS

    CED: Consistent ensemble distillation for audio tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training fram… ▽ More

    Submitted 7 September, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

  23. arXiv:2308.10181  [pdf

    eess.SY

    Stochastic Optimization of Coupled Power Distribution-Urban Transportation Network Operations with Autonomous Mobility on Demand Systems

    Authors: Han Wang, Xiaoyuan Xu, Yue Chen, Zheng Yan, Mohammad Shahidehpour, Jiaqi Li, Shaolun Xu

    Abstract: Autonomous mobility on demand systems (AMoDS) will significantly affect the operation of coupled power distribution-urban transportation networks (PTNs) by the optimal dispatch of electric vehicles (EVs). This paper proposes an uncertainty method to analyze the operational states of PTNs with AMoDS. First, a PTN operation framework is designed considering the controllable EVs dispatched by AMoDS a… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: 10 pages, 13 figures

  24. arXiv:2308.06496  [pdf, ps, other

    cs.LG cs.PF eess.SP

    Performance Analysis for Resource Constrained Decentralized Federated Learning Over Wireless Networks

    Authors: Zhigang Yan, Dong Li

    Abstract: Federated learning (FL) can lead to significant communication overhead and reliance on a central server. To address these challenges, decentralized federated learning (DFL) has been proposed as a more resilient framework. DFL involves parameter exchange between devices through a wireless network. This study analyzes the performance of resource-constrained DFL using different communication schemes… ▽ More

    Submitted 12 August, 2023; originally announced August 2023.

  25. arXiv:2307.13643  [pdf, other

    cs.CR cs.SD eess.AS

    Backdoor Attacks against Voice Recognition Systems: A Survey

    Authors: Baochen Yan, Jiahe Lan, Zheng Yan

    Abstract: Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and priva… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: 33 pages, 7 figures

  26. arXiv:2307.13158  [pdf, other

    cs.LG cs.RO eess.SY

    Multi-UAV Speed Control with Collision Avoidance and Handover-aware Cell Association: DRL with Action Branching

    Authors: Zijiang Yan, Wael Jaafar, Bassant Selim, Hina Tabassum

    Abstract: This paper presents a deep reinforcement learning solution for optimizing multi-UAV cell-association decisions and their moving velocity on a 3D aerial highway. The objective is to enhance transportation and communication performance, including collision avoidance, connectivity, and handovers. The problem is formulated as a Markov decision process (MDP) with UAVs' states defined by velocities and… ▽ More

    Submitted 21 January, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

    Comments: IEEE Globecom 2023 Accepted

  27. arXiv:2306.16241  [pdf, other

    cs.SD eess.AS

    Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

    Authors: Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor,… ▽ More

    Submitted 7 October, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Proc. INTERSPEECH 2023, 2488-2492, doi: 10.21437/Interspeech.2023-218

  28. arXiv:2306.16197  [pdf, other

    cs.CV eess.IV

    Multi-IMU with Online Self-Consistency for Freehand 3D Ultrasound Reconstruction

    Authors: Mingyuan Luo, Xin Yang, Zhongnuo Yan, Junyu Li, Yuanji Zhang, Jiongquan Chen, Xindi Hu, Jikuan Qian, Jun Cheng, Dong Ni

    Abstract: Ultrasound (US) imaging is a popular tool in clinical diagnosis, offering safety, repeatability, and real-time capabilities. Freehand 3D US is a technique that provides a deeper understanding of scanned regions without increasing complexity. However, estimating elevation displacement and accumulation error remains challenging, making it difficult to infer the relative position using images alone.… ▽ More

    Submitted 18 July, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Accepted by MICCAI-2023

  29. arXiv:2306.14170  [pdf, other

    cs.MM cs.SD eess.AS

    AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

    Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

    Abstract: Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted by ICASSP2023

  30. arXiv:2305.18794  [pdf, other

    cs.SD eess.AS

    Understanding temporally weakly supervised training: A case study for keyword spotting

    Authors: Heinrich Dinkel, Weiji Zhuang, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporal… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  31. arXiv:2305.17834  [pdf, other

    cs.SD eess.AS

    Streaming Audio Transformers for Online Audio Tagging

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audi… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech2024

  32. arXiv:2305.10680  [pdf, other

    cs.SD cs.CL eess.AS

    Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

    Authors: Xian Shi, Haoneng Luo, Zhifu Gao, Shiliang Zhang, Zhijie Yan

    Abstract: Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies. Previous end-to-end~(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. In this paper we proposed CI… ▽ More

    Submitted 24 May, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: 5 pages, 4 figures, Interspeech2023

  33. arXiv:2305.07883  [pdf, other

    eess.IV cs.CV

    Towards Generalizable Medical Image Segmentation with Pixel-wise Uncertainty Estimation

    Authors: Shuai Wang, Zipei Yan, Daoan Zhang, Zhongsen Li, Sirui Wu, Wenxuan Chen, Rui Li

    Abstract: Deep neural networks (DNNs) achieve promising performance in visual recognition under the independent and identically distributed (IID) hypothesis. In contrast, the IID hypothesis is not universally guaranteed in numerous real-world applications, especially in medical image analysis. Medical image segmentation is typically formulated as a pixel-wise classification task in which each pixel is class… ▽ More

    Submitted 24 June, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: 10 pages, 3 figures

  34. arXiv:2303.15161  [pdf, other

    cs.SD eess.AS

    Data Augmentation for Environmental Sound Classification Using Diffusion Probabilistic Model with Top-k Selection Discriminator

    Authors: Yunhao Chen, Yunjie Zhu, Zihui Yan, Jianlu Shen, Zhen Ren, Yifan Huang

    Abstract: Despite consistent advancement in powerful deep learning techniques in recent years, large amounts of training data are still necessary for the models to avoid overfitting. Synthetic datasets using generative adversarial networks (GAN) have recently been generated to overcome this problem. Nevertheless, despite advancements, GAN-based methods are usually hard to train or fail to generate high-qual… ▽ More

    Submitted 4 April, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

  35. arXiv:2303.13932  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

    Authors: Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

    Abstract: ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topi… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

  36. arXiv:2303.01812  [pdf, other

    cs.SD eess.AS

    Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Keyword spotting (KWS) is a core human-machine-interaction front-end task for most modern intelligent assistants. Recently, a unified (UniKW-AT) framework has been proposed that adds additional capabilities in the form of audio tagging (AT) to a KWS model. However, previous work did not consider the real-world deployment of a UniKW-AT model, where factors such as model size and inference speed are… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  37. Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part II: Method and Applications

    Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

    Abstract: This two-part paper develops a non-iterative coordinated optimal dispatch framework, i.e., free of iterative information exchange, via the innovation of the equivalent projection (EP) theory. The EP eliminates internal variables from technical and economic operation constraints of the subsystem and obtains an equivalent model with reduced scale, which is the key to the non-iterative coordinated op… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  38. Non-Iterative Solution for Coordinated Optimal Dispatch via Equivalent Projection-Part I: Theory

    Authors: Zhenfei Tan, Zheng Yan, Haiwang Zhong, Qing Xia

    Abstract: Coordinated optimal dispatch is of utmost importance for the efficient and secure operation of hierarchically structured power systems. Conventional coordinated optimization methods, such as the Lagrangian relaxation and Benders decomposition, require iterative information exchange among subsystems. Iterative coordination methods have drawbacks including slow convergence, risk of oscillation and d… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  39. arXiv:2301.12343  [pdf, other

    cs.SD cs.CL eess.AS

    Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

    Authors: Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan

    Abstract: Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire pla… ▽ More

    Submitted 28 January, 2023; originally announced January 2023.

  40. arXiv:2212.00500  [pdf, other

    cs.MM cs.CL cs.LG cs.SD eess.AS

    MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

    Authors: Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, Chang Zhou

    Abstract: In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other… ▽ More

    Submitted 29 November, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  41. arXiv:2211.10243  [pdf, other

    cs.SD cs.MM eess.AS

    Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

    Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

    Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-l… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    Comments: Accepted by EMNLP 2022

  42. arXiv:2211.02940  [pdf, other

    cs.SD cs.AI eess.AS

    Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block

    Authors: Yunhao Chen, Yunjie Zhu, Zihui Yan, Yifan Huang, Zhen Ren, Jianlu Shen, Lifang Chen

    Abstract: Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the… ▽ More

    Submitted 30 May, 2023; v1 submitted 5 November, 2022; originally announced November 2022.

  43. arXiv:2210.08493  [pdf, other

    cs.RO cs.LG eess.SY

    Indoor Smartphone SLAM with Learned Echoic Location Features

    Authors: Wenjie Luo, Qun Song, Zhenyu Yan, Rui Tan, Guosheng Lin

    Abstract: Indoor self-localization is a highly demanded system function for smartphones. The current solutions based on inertial, radio frequency, and geomagnetic sensing may have degraded performance when their limiting factors take effect. In this paper, we present a new indoor simultaneous localization and mapping (SLAM) system that utilizes the smartphone's built-in audio hardware and inertial measureme… ▽ More

    Submitted 16 October, 2022; originally announced October 2022.

  44. arXiv:2210.02287  [pdf

    cs.SD cs.LG eess.AS

    TC-SKNet with GridMask for Low-complexity Classification of Acoustic scene

    Authors: Luyuan Xie, Yan Zhong, Lin Yang, Zhaoyu Yan, Zhonghai Wu, Junjie Wang

    Abstract: Convolution neural networks (CNNs) have good performance in low-complexity classification tasks such as acoustic scene classifications (ASCs). However, there are few studies on the relationship between the length of target speech and the size of the convolution kernels. In this paper, we combine Selective Kernel Network with Temporal-Convolution (TC-SKNet) to adjust the receptive field of convolut… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: Accepted to APSIPA ASC 2022

  45. An empirical study of weakly supervised audio tagging embeddings for general audio representations

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning meth… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: Odyssey 2022

  46. UniKW-AT: Unified Keyword Spotting and Audio Tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training bot… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  47. arXiv:2208.14319  [pdf

    eess.SP cs.AI eess.SY

    Representation Learning based and Interpretable Reactor System Diagnosis Using Denoising Padded Autoencoder

    Authors: Chengyuan Li, Zhifang Qiu, Zhangrui Yan, Meifu Li

    Abstract: With the mass construction of Gen III nuclear reactors, it is a popular trend to use deep learning (DL) techniques for fast and effective diagnosis of possible accidents. To overcome the common problems of previous work in diagnosing reactor accidents using deep learning theory, this paper proposes a diagnostic process that ensures robustness to noisy and crippled data and is interpretable. First,… ▽ More

    Submitted 23 September, 2022; v1 submitted 30 August, 2022; originally announced August 2022.

  48. arXiv:2208.06833  [pdf

    eess.IV cs.CV

    Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification

    Authors: Tianyi Zhang, Youdan Feng, Yunlu Feng, Yu Zhao, Yanli Lei, Nan Ying, Zhiling Yan, Yufang He, Guanglei Zhang

    Abstract: The rapid on-site evaluation (ROSE) technique can signifi-cantly accelerate the diagnosis of pancreatic cancer by im-mediately analyzing the fast-stained cytopathological images. Computer-aided diagnosis (CADきゃど) can potentially address the shortage of pathologists in ROSE. However, the cancerous patterns vary significantly between different samples, making the CADきゃど task extremely challenging. Besides… ▽ More

    Submitted 14 August, 2022; originally announced August 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2206.03080

  49. arXiv:2206.15069  [pdf, other

    eess.IV cs.CV

    PVT-COV19D: Pyramid Vision Transformer for COVID-19 Diagnosis

    Authors: Lilang Zheng, Jiaxuan Fang, Xiaorun Tang, Hanzhang Li, Jiaxin Fan, Tianyi Wang, Rui Zhou, Zhaoyan Yan

    Abstract: With the outbreak of COVID-19, a large number of relevant studies have emerged in recent years. We propose an automatic COVID-19 diagnosis framework based on lung CT scan images, the PVT-COV19D. In order to accommodate the different dimensions of the image input, we first classified the images using Transformer models, then sampled the images in the dataset according to normal distribution, and fe… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: 8 pages,1 figure

  50. arXiv:2206.10082  [pdf, other

    cs.CV eess.IV

    Optimally Controllable Perceptual Lossy Compression

    Authors: Zeyu Yan, Fei Wen, Peilin Liu

    Abstract: Recent studies in lossy compression show that distortion and perceptual quality are at odds with each other, which put forward the tradeoff between distortion and perception (D-P). Intuitively, to attain different perceptual quality, different decoders have to be trained. In this paper, we present a nontrivial finding that only two decoders are sufficient for optimally achieving arbitrary (an infi… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: ICML 2022