(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 112 results for author: Ren, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.08239  [pdf, other

    cs.SD cs.LG eess.AS

    An Unsupervised Domain Adaptation Method for Locating Manipulated Region in partially fake Audio

    Authors: Siding Zeng, Jiangyan Yi, Jianhua Tao, Yujie Chen, Shan Liang, Yong Ren, Xiaohui Zhang

    Abstract: When the task of locating manipulation regions in partially-fake audio (PFA) involves cross-domain datasets, the performance of deep learning models drops significantly due to the shift between the source and target domains. To address this issue, existing approaches often employ data augmentation before training. However, they overlook the characteristics in target domain that are absent in sourc… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  2. arXiv:2407.07464  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Video-to-Audio Generation with Hidden Alignment

    Authors: Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu

    Abstract: Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techni… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: https://sites.google.com/view/vta-ldm

  3. arXiv:2407.04575  [pdf, other

    eess.AS

    FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

    Authors: Rubing Shen, Yanzhen Ren, Zongkun Sun

    Abstract: Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the ali… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  4. arXiv:2406.16326  [pdf, other

    eess.AS

    RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

    Authors: Mingyang Zhang, Yi Zhou, Yi Ren, Chen Zhang, Xiang Yin, Haizhou Li

    Abstract: This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and loc… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Manuscript under review by TASLP

  5. arXiv:2406.04840  [pdf, other

    cs.SD eess.AS

    TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

    Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

    Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: acceped by interspeech 2024

  6. arXiv:2404.05976  [pdf, other

    cs.LG eess.SY stat.ME

    A Cyber Manufacturing IoT System for Adaptive Machine Learning Model Deployment by Interactive Causality Enabled Self-Labeling

    Authors: Yutian Ren, Yuqi He, Xuyin Zhang, Aaron Yen, G. P. Li

    Abstract: Machine Learning (ML) has been demonstrated to improve productivity in many manufacturing applications. To host these ML applications, several software and Industrial Internet of Things (IIoT) systems have been proposed for manufacturing applications to deploy ML applications and provide real-time intelligence. Recently, an interactive causality enabled self-labeling method has been proposed to ad… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  7. arXiv:2403.03915  [pdf, other

    math.OC eess.SY math.PR q-fin.MF q-fin.RM

    Risk-Sensitive Mean Field Games with Common Noise: A Theoretical Study with Applications to Interbank Markets

    Authors: Xin Yue Ren, Dena Firoozi

    Abstract: In this paper, we address linear-quadratic-Gaussian (LQG) risk-sensitive mean field games (MFGs) with common noise. In this framework agents are exposed to a common noise and aim to minimize an exponential cost functional that reflects their risk sensitivity. We leverage the convex analysis method to derive the optimal strategies of agents in the limit as the number of agents goes to infinity. The… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: 47 pages

  8. arXiv:2402.00616  [pdf

    eess.SP

    Dual-Tap Optical-Digital Feedforward Equalization Enabling High-Speed Optical Transmission in IM/DD Systems

    Authors: Yu Guo, Yangbo Wu, Zhao Yang, Lei Xue, Ning Liang, Yang Ren, Zhengrui Tu, Jia Feng, Qunbi Zhuge

    Abstract: Intensity-modulation and direct-detection (IM/DD) transmission is widely adopted for high-speed optical transmission scenarios due to its cost-effectiveness and simplicity. However, as the data rate increases, the fiber chromatic dispersion (CD) would induce a serious power fading effect, and direct detection could generate inter-symbol interference (ISI). Moreover, the ISI becomes more severe wit… ▽ More

    Submitted 1 February, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: 6 pages, 7 gigures, journal

  9. arXiv:2401.14612  [pdf, ps, other

    math.OC eess.SY

    On Inhomogeneous Infinite Products of Stochastic Matrices and Applications

    Authors: Zhaoyue Xia, Jun Du, Chunxiao Jiang, H. Vincent Poor, Zhu Han, Yong Ren

    Abstract: With the growth of magnitude of multi-agent networks, distributed optimization holds considerable significance within complex systems. Convergence, a pivotal goal in this domain, is contingent upon the analysis of infinite products of stochastic matrices (IPSMs). In this work, convergence properties of inhomogeneous IPSMs are investigated. The convergence rate of inhomogeneous IPSMs towards an abs… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  10. arXiv:2312.15946  [pdf, other

    cs.SD cs.GR eess.AS

    EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

    Authors: Bo Han, Yi Ren, Hao Peng, Teng Zhang, Zeyu Ling, Xiang Yin, Feilin Han

    Abstract: The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  11. arXiv:2312.11947  [pdf, other

    cs.CL cs.SD eess.AS

    Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

    Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

    Abstract: Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion mo… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 9 pages, 4 figures, Accepted by AAAI'2024, Code and audio samples: https://github.com/walker-hyf/ECSS

  12. arXiv:2312.10953  [pdf, other

    eess.SY

    Nonparametric Stochastic Analysis of Dynamic Frequency in Power Systems: A Generalized Ito Process Model

    Authors: Can Wan, Yupeng Ren, Ping Ju

    Abstract: The large-scale integration of intermittent renewable energy has brought serious challenges to the frequency security of power systems. In this paper, a novel nonparametric stochastic analysis method of system dynamic frequency is proposed to accurately analyze the impact of renewable energy uncertainty on power system frequency security, independent of any parametric distribution assumption. The… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  13. arXiv:2310.15801  [pdf, other

    cs.IT cs.AR eess.SP

    A Generalized Adjusted Min-Sum Decoder for 5G LDPC Codes: Algorithm and Implementation

    Authors: Yuqing Ren, Hassan Harb, Yifei Shen, Alexios Balatsoukas-Stimming, Andreas Burg

    Abstract: 5G New Radio (NR) has stringent demands on both performance and complexity for the design of low-density parity-check (LDPC) decoding algorithms and corresponding VLSI implementations. Furthermore, decoders must fully support the wide range of all 5G NR blocklengths and code rates, which is a significant challenge. In this paper, we present a high-performance and low-complexity LDPC decoder, tailo… ▽ More

    Submitted 17 February, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: 14 pages, 15 figures, accepted by IEEE Transactions on Circuits and Systems I: Regular Paper

  14. arXiv:2310.00014  [pdf, other

    cs.SD eess.AS

    Fewer-token Neural Speech Codec with Time-invariant Codes

    Authors: Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chuyuan Zhang, Junzuo Zhou

    Abstract: Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert speech into discrete token representations. However, excessive token sequences from the codec may negatively affect prediction accuracy and restrict the progre… ▽ More

    Submitted 10 March, 2024; v1 submitted 15 September, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP 2024

  15. arXiv:2309.16265  [pdf, other

    cs.SD eess.AS

    Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description

    Authors: Wuyang Liu, Yanzhen Ren

    Abstract: Most audio tagging models are trained with one-hot labels as supervised information. However, one-hot labels treat all sound events equally, ignoring the semantic hierarchy and proximity relationships between sound events. In contrast, the event descriptions contains richer information, describing the distance between different sound events with semantic proximity. In this paper, we explore the im… ▽ More

    Submitted 16 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures. Accepted by ICASSP 2024

  16. Driving behavior-guided battery health monitoring for electric vehicles using machine learning

    Authors: Nanhua Jiang, Jiawei Zhang, Weiran Jiang, Yao Ren, Jing Lin, Edwin Khoo, Ziyou Song

    Abstract: An accurate estimation of the state of health (SOH) of batteries is critical to ensuring the safe and reliable operation of electric vehicles (EVs). Feature-based machine learning methods have exhibited enormous potential for rapidly and precisely monitoring battery health status. However, simultaneously using various health indicators (HIs) may weaken estimation performance due to feature redunda… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Journal ref: Applied Energy (2024)

  17. arXiv:2309.08166  [pdf, other

    cs.SD eess.AS

    Controllable Residual Speaker Representation for Voice Conversion

    Authors: Le Xu, Jiangyan Yi, Jianhua Tao, Tao Wang, Yong Ren, Rongxiu Zhong

    Abstract: Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP 2024

  18. arXiv:2309.01480  [pdf, other

    cs.SD cs.AI eess.AS

    BadSQA: Stealthy Backdoor Attacks Using Presence Events as Triggers in Non-Intrusive Speech Quality Assessment

    Authors: Ying Ren, Kailai Shen, Zhe Ye, Diqun Yan

    Abstract: Non-Intrusive speech quality assessment (NISQA) has gained significant attention for predicting the mean opinion score (MOS) of speech without requiring the reference speech. In practical NISQA scenarios, untrusted third-party resources are often employed during deep neural network training to reduce costs. However, it would introduce a potential security vulnerability as specially designed untrus… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: 5 pages, 6 figures,conference

  19. arXiv:2308.12792  [pdf, other

    cs.SD eess.AS

    Sparks of Large Audio Models: A Survey and Outlook

    Authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, Björn W. Schuller

    Abstract: This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Pr… ▽ More

    Submitted 21 September, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: Under review, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models

  20. arXiv:2308.02844  [pdf, other

    cs.IR cs.SD eess.AS

    Bootstrapping Contrastive Learning Enhanced Music Cold-Start Matching

    Authors: Xinping Zhao, Ying Zhang, Qiang Xiao, Yuming Ren, Yingchun Yang

    Abstract: We study a particular matching task we call Music Cold-Start Matching. In short, given a cold-start song request, we expect to retrieve songs with similar audiences and then fastly push the cold-start song to the audiences of the retrieved songs to warm up it. However, there are hardly any studies done on this task. Therefore, in this paper, we will formalize the problem of Music Cold-Start Matchi… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

    Comments: Accepted by WWW'2023

    ACM Class: F.2.2; I.2.8

    Journal ref: Companion Proceedings of the ACM Web Conference 2023, April 2023, Pages 351-355

  21. arXiv:2307.13346  [pdf, other

    cs.SD cs.MM eess.AS

    A Snoring Sound Dataset for Body Position Recognition: Collection, Annotation, and Analysis

    Authors: Li Xiao, Xiuping Yang, Xinhong Li, Weiping Tu, Xiong Chen, Weiyan Yi, Jie Lin, Yuhong Yang, Yanzhen Ren

    Abstract: Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS) is a chronic breathing disorder caused by a blockage in the upper airways. Snoring is a prominent symptom of OSAHS, and previous studies have attempted to identify the obstruction site of the upper airways by snoring sounds. Despite some progress, the classification of the obstruction site remains challenging in real-world clinical settings due to… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted to INTERSPEECH 2023

  22. arXiv:2307.07218  [pdf, other

    eess.AS cs.SD

    Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

    Authors: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which si… ▽ More

    Submitted 10 April, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: Accepted by ICLR 2024

  23. arXiv:2307.00861  [pdf, other

    cs.RO eess.SY

    Perch a quadrotor on planes by the ceiling effect

    Authors: Yuying Zou, Haotian Li, Yunfan Ren, Wei Xu, Yihang Li, Yixi Cai, Shenji Zhou, Fu Zhang

    Abstract: Perching is a promising solution for a small unmanned aerial vehicle (UAV) to save energy and extend operation time. This paper proposes a quadrotor that can perch on planar structures using the ceiling effect. Compared with the existing work, this perching method does not require any claws, hooks, or adhesive pads, leading to a simpler system design. This method does not limit the perching by sur… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

  24. arXiv:2306.15304  [pdf, other

    eess.AS cs.SD

    GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

    Authors: Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi Ren, Xiang Yin, Zejun Ma

    Abstract: Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  25. arXiv:2306.09215  [pdf, other

    eess.SY

    On the Effects and Optimal Design of Redundant Sensors in Collaborative State Estimation

    Authors: Yunxiao Ren, Zhisheng Duan, Peihu Duan, Ling Shi

    Abstract: The existence of redundant sensors in collaborative state estimation is a common occurrence, yet their true significance remains elusive. This paper comprehensively investigates the effects and optimal design of redundant sensors in sensor networks that use Kalman filtering to estimate the state of a random process collaboratively. The paper presents two main results: a theoretical analysis of the… ▽ More

    Submitted 4 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

  26. arXiv:2306.03509  [pdf, other

    eess.AS cs.AI cs.SD

    Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

    Authors: Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or un… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  27. arXiv:2306.03504  [pdf, other

    cs.CV cs.SD eess.AS

    Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

    Authors: Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technic… ▽ More

    Submitted 2 August, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted by ICML 2023 Workshop, 6 pages, 3 figures

  28. arXiv:2305.18474  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

    Authors: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  29. arXiv:2305.17732  [pdf, other

    cs.SD eess.AS

    StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

    Authors: Kun Song, Yi Ren, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie, Xiang Yin, Zejun Ma

    Abstract: Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more p… ▽ More

    Submitted 25 July, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  30. arXiv:2305.15403  [pdf, other

    cs.CL cs.SD eess.AS

    AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

    Authors: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao

    Abstract: Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual spe… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  31. arXiv:2305.13774  [pdf, other

    cs.SD eess.AS

    ADD 2023: the Second Audio Deepfake Detection Challenge

    Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

    Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  32. arXiv:2305.13612  [pdf, other

    cs.SD eess.AS

    FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

    Authors: Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, Zhou Zhao

    Abstract: Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted by ACL 2023 (Findings)

  33. arXiv:2305.10763  [pdf, other

    cs.SD eess.AS

    CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

    Authors: Zhenhui Ye, Rongjie Huang, Yi Ren, Ziyue Jiang, Jinglin Liu, Jinzheng He, Xiang Yin, Zhou Zhao

    Abstract: Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the s… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by ACL 2023 (Main Conference)

  34. arXiv:2305.09255  [pdf, other

    cs.CR eess.SP

    Trust-Worthy Semantic Communications for the Metaverse Relying on Federated Learning

    Authors: Jianrui Chen, Jingjing Wang, Chunxiao Jiang, Yong Ren, Lajos Hanzo

    Abstract: As an evolving successor to the mobile Internet, the Metaverse creates the impression of an immersive environment, integrating the virtual as well as the real world. In contrast to the traditional mobile Internet based on servers, the Metaverse is constructed by billions of cooperating users by harnessing their smart edge devices having limited communication and computation resources. In this imme… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  35. arXiv:2305.06640  [pdf, other

    eess.AS cs.AI cs.IT cs.SD

    Speaker Diaphragm Excursion Prediction: deep attention and online adaptation

    Authors: Yuwei Ren, Matt Zivney, Yin Huang, Eddie Choy, Chirag Patel, Hao Xu

    Abstract: Speaker protection algorithm is to leverage the playback signal properties to prevent over excursion while maintaining maximum loudness, especially for the mobile phone with tiny loudspeakers. This paper proposes efficient DL solutions to accurately model and predict the nonlinear excursion, which is challenging for conventional solutions. Firstly, we build the experiment and pre-processing pipeli… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: 5 pages, 4 figures, ICASSP 2023

  36. arXiv:2305.05152  [pdf, other

    cs.SD cs.MM eess.AS

    Who is Speaking Actually? Robust and Versatile Speaker Traceability for Voice Conversion

    Authors: Yanzhen Ren, Hongcheng Zhu, Liming Zhai, Zongkun Sun, Rubing Shen, Lina Wang

    Abstract: Voice conversion (VC), as a voice style transfer technology, is becoming increasingly prevalent while raising serious concerns about its illegal use. Proactively tracing the origins of VC-generated speeches, i.e., speaker traceability, can prevent the misuse of VC, but unfortunately has not been extensively studied. In this paper, we are the first to investigate the speaker traceability for VC and… ▽ More

    Submitted 26 July, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: has been accepted by ACM MM 2023

  37. arXiv:2304.12995  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

    Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

    Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  38. arXiv:2304.06793  [pdf, other

    cs.NE cs.LG eess.IV

    Speck: A Smart event-based Vision Sensor with a low latency 327K Neuron Convolutional Neuronal Network Processing Pipeline

    Authors: Ole Richter, Yannan Xing, Michele De Marchi, Carsten Nielsen, Merkourios Katsimpris, Roberto Cattaneo, Yudi Ren, Yalun Hu, Qian Liu, Sadique Sheik, Tugba Demirci, Ning Qiao

    Abstract: Edge computing solutions that enable the extraction of high-level information from a variety of sensors is in increasingly high demand. This is due to the increasing number of smart devices that require sensory processing for their application on the edge. To tackle this problem, we present a smart vision sensor System on Chip (SoC), featuring an event-based camera and a low-power asynchronous spi… ▽ More

    Submitted 27 May, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: accepted and presented at 28th IEEE International Symposium On Asynchronous Circuits and Systems (ASYNC) 2023

    Journal ref: IEEE ASYNC 2023

  39. arXiv:2303.13932  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

    Authors: Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

    Abstract: ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topi… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

  40. arXiv:2302.06611  [pdf, other

    eess.IV

    Deep Learning and Medical Imaging for COVID-19 Diagnosis: A Comprehensive Survey

    Authors: Song Wu, Yazhou Ren, Aodi Yang, Xinyue Chen, Xiaorong Pu, Jing He, Liqiang Nie, Philip S. Yu

    Abstract: COVID-19 (Coronavirus disease 2019) has been quickly spreading since its outbreak, impacting financial markets and healthcare systems globally. Countries all around the world have adopted a number of extraordinary steps to restrict the spreading virus, where early COVID-19 diagnosis is essential. Medical images such as X-ray images and Computed Tomography scans are becoming one of the main diagnos… ▽ More

    Submitted 12 February, 2023; originally announced February 2023.

  41. arXiv:2301.12661  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

    Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

    Abstract: Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses t… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

    Comments: Audio samples are available at https://Text-to-Audio.github.io

  42. arXiv:2301.10295  [pdf, other

    cs.CV cs.SD eess.AS

    Object Segmentation with Audio Context

    Authors: Kaihui Zheng, Yuqing Ren, Zixin Shen, Tianxu Qin

    Abstract: Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we integrate audio features into our video segmentation model to conduct an audio-visual learning scheme. Our method is based on existing video instance segmentation… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

    Comments: Research project for Introduction to Deep Learning (11785) at Carnegie Mellon University

  43. arXiv:2301.09080  [pdf, other

    cs.MM cs.SD eess.AS

    Dance2MIDI: Dance-driven multi-instruments music generation

    Authors: Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han

    Abstract: Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak co… ▽ More

    Submitted 27 February, 2024; v1 submitted 22 January, 2023; originally announced January 2023.

    Comments: has been accepted by Computational Visual Media Journal

  44. arXiv:2211.15002  [pdf

    eess.SP cs.CV

    A Model-data-driven Network Embedding Multidimensional Features for Tomographic SAR Imaging

    Authors: Yu Ren, Xiaoling Zhang, Xu Zhan, Jun Shi, Shunjun Wei, Tianjiao Zeng

    Abstract: Deep learning (DL)-based tomographic SAR imaging algorithms are gradually being studied. Typically, they use an unfolding network to mimic the iterative calculation of the classical compressive sensing (CS)-based methods and process each range-azimuth unit individually. However, only one-dimensional features are effectively utilized in this way. The correlation between adjacent resolution units is… ▽ More

    Submitted 27 November, 2022; originally announced November 2022.

  45. arXiv:2211.10666  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

    Authors: Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, Zhou Zhao

    Abstract: Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of genera… ▽ More

    Submitted 19 November, 2022; originally announced November 2022.

  46. arXiv:2211.04944  [pdf, other

    cs.RO eess.SY

    Safety-Critical Optimal Control for Robotic Manipulators in A Cluttered Environment

    Authors: Xuda Ding, Han Wang, Yi Ren, Yu Zheng, Cailian Chen, Jianping He

    Abstract: Designing safety-critical control for robotic manipulators is challenging, especially in a cluttered environment. First, the actual trajectory of a manipulator might deviate from the planned one due to the complex collision environments and non-trivial dynamics, leading to collision; Second, the feasible space for the manipulator is hard to obtain since the explicit distance functions between coll… ▽ More

    Submitted 10 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Submitted to IEEE RA-L

  47. arXiv:2211.00222  [pdf, other

    cs.SD cs.MM eess.AS

    SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation

    Authors: Chen Zhang, Yi Ren, Kejun Zhang, Shuicheng Yan

    Abstract: While deep generative models have empowered music generation, it remains a challenging and under-explored problem to edit an existing musical piece at fine granularity. In this paper, we propose SDMuse, a unified Stochastic Differential Music editing and generation framework, which can not only compose a whole musical piece from scratch, but also modify existing musical pieces in many ways, such a… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 October, 2022; originally announced November 2022.

  48. arXiv:2210.11069  [pdf, other

    cs.IT cs.AR eess.SP

    Pipelined Architecture for Soft-decision Iterative Projection Aggregation Decoding for RM Codes

    Authors: Marzieh Hashemipour-Nazari, Yuqing Ren, Kees Goossens, Alexios Balatsoukas-Stimming

    Abstract: The recently proposed recursive projection-aggregation (RPA) decoding algorithm for Reed-Muller codes has received significant attention as it provides near-ML decoding performance at reasonable complexity for short codes. However, its complicated structure makes it unsuitable for hardware implementation. Iterative projection-aggregation (IPA) decoding is a modified version of RPA decoding that si… ▽ More

    Submitted 6 September, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

  49. Denoising of 3D MR images using a voxel-wise hybrid residual MLP-CNN model to improve small lesion diagnostic confidence

    Authors: Haibo Yang, Shengjie Zhang, Xiaoyang Han, Botao Zhao, Yan Ren, Yaru Sheng, Xiao-Yong Zhang

    Abstract: Small lesions in magnetic resonance imaging (MRI) images are crucial for clinical diagnosis of many kinds of diseases. However, the MRI quality can be easily degraded by various noise, which can greatly affect the accuracy of diagnosis of small lesion. Although some methods for denoising MR images have been proposed, task-specific denoising methods for improving the diagnosis confidence of small l… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: accepted by MICCAI 2022

  50. AETomo-Net: A Novel Deep Learning Network for Tomographic SAR Imaging Based on Multi-dimensional Features

    Authors: Yu Ren, Xiaoling Zhang, Yunqiao Hu, Xu Zhan

    Abstract: Tomographic synthetic aperture radar (TomoSAR) imaging algorithms based on deep learning can effectively reduce computational costs. The idea of existing researches is to reconstruct the elevation for each range-azimuth cell in one-dimensional using a deep-unfolding network. However, since these methods are commonly sensitive to signal sparsity level, it usually leads to some drawbacks like contin… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.