(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 134 results for author: Yang, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.17897  [pdf, other

    eess.IV

    Pixel-weighted Multi-pose Fusion for Metal Artifact Reduction in X-ray Computed Tomography

    Authors: Diyu Yang, Craig A. J. Kemp, Soumendu Majee, Gregery T. Buzzard, Charles A. Bouman

    Abstract: X-ray computed tomography (CT) reconstructs the internal morphology of a three dimensional object from a collection of projection images, most commonly using a single rotation axis. However, for objects containing dense materials like metal, the use of a single rotation axis may leave some regions of the object obscured by the metal, even though projections from other rotation axes (or poses) migh… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Submitted to IEEE MMSP 2024. arXiv admin note: substantial text overlap with arXiv:2209.07561

  2. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  3. arXiv:2406.08336  [pdf, other

    cs.SD cs.CV eess.AS

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (… ▽ More

    Submitted 24 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2406.06329  [pdf, other

    cs.CL eess.AS

    A Parameter-efficient Language Extension Framework for Multilingual ASR

    Authors: Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

    Abstract: Covering all languages with a multilingual speech recognition model (MASR) is very difficult. Performing language extension on top of an existing MASR is a desirable choice. In this study, the MASR continual learning problem is probabilistically decomposed into language identity prediction (LP) and cross-lingual adaptation (XLA) sub-problems. Based on this, we propose an architecture-based framewo… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  5. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  6. arXiv:2406.02328  [pdf, other

    cs.SD eess.AS

    SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

    Abstract: In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac… ▽ More

    Submitted 14 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  7. arXiv:2405.03141  [pdf, other

    eess.IV cs.AI cs.CV physics.med-ph

    Automatic Ultrasound Curve Angle Measurement via Affinity Clustering for Adolescent Idiopathic Scoliosis Evaluation

    Authors: Yihao Zhou, Timothy Tin-Yan Lee, Kelly Ka-Lee Lai, Chonglin Wu, Hin Ting Lau, De Yang, Chui-Yi Chan, Winnie Chiu-Wing Chu, Jack Chun-Yiu Cheng, Tsz-Ping Lam, Yong-Ping Zheng

    Abstract: The current clinical gold standard for evaluating adolescent idiopathic scoliosis (AIS) is X-ray radiography, using Cobb angle measurement. However, the frequent monitoring of the AIS progression using X-rays poses a challenge due to the cumulative radiation exposure. Although 3D ultrasound has been validated as a reliable and radiation-free alternative for scoliosis assessment, the process of mea… ▽ More

    Submitted 6 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

  8. arXiv:2404.04427  [pdf

    physics.med-ph eess.IV

    A comprehensive liver CT landmark pair dataset for evaluating deformable image registration algorithms

    Authors: Zhendong Zhang, Edward Robert Criscuolo, Yao Hao, Deshan Yang

    Abstract: Purpose: Evaluating deformable image registration (DIR) algorithms is vital for enhancing algorithm performance and gaining clinical acceptance. However, there's a notable lack of dependable DIR benchmark datasets for assessing DIR performance except for lung images. To address this gap, we aim to introduce our comprehensive liver computed tomography (CT) DIR landmark dataset library. Acquisitio… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  9. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  10. arXiv:2403.13720  [pdf, other

    cs.SD eess.AS

    UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

    Authors: Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

    Abstract: We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech c… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: 5 pages, 3 figures

  11. arXiv:2403.12970  [pdf

    eess.IV cs.CV physics.bio-ph physics.optics

    Hybrid deep learning and physics-based neural network for programmable illumination computational microscopy

    Authors: Ruiqing Sun, Delong Yang, Shaohui Zhang, Qun Hao

    Abstract: Relying on either deep models or physical models are two mainstream approaches for solving inverse sample reconstruction problems in programmable illumination computational microscopy. Solutions based on physical models possess strong generalization capabilities while struggling with global optimization of inverse problems due to a lack of insufficient physical constraints. In contrast, deep learn… ▽ More

    Submitted 17 January, 2024; originally announced March 2024.

  12. arXiv:2403.11974  [pdf, other

    eess.IV cs.CV

    OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

    Authors: Yang Li, Qiuyi Huang, Chong Zhong, Danjuan Yang, Meiyan Li, A. H. Welsh, Aiyi Liu, Bo Fu, Catherien C. Liu, Xingtao Zhou

    Abstract: Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  13. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  14. arXiv:2402.00288  [pdf, other

    eess.AS cs.SD

    Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito

    Abstract: Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and in… ▽ More

    Submitted 14 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted by INTERSPEECH2024

  15. arXiv:2401.03800  [pdf, other

    cs.CV eess.IV

    MvKSR: Multi-view Knowledge-guided Scene Recovery for Hazy and Rainy Degradation

    Authors: Dong Yang, Wenyu Xu, Yuan Gao, Yuxu Lu, Jingming Zhang, Yu Guo

    Abstract: High-quality imaging is crucial for ensuring safety supervision and intelligent deployment in fields like transportation and industry. It enables precise and detailed monitoring of operations, facilitating timely detection of potential hazards and efficient management. However, adverse weather conditions, such as atmospheric haziness and precipitation, can have a significant impact on image qualit… ▽ More

    Submitted 8 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

  16. arXiv:2401.03689  [pdf, other

    eess.AS cs.SD

    LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

    Authors: Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

    Abstract: Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically i… ▽ More

    Submitted 10 June, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024

  17. arXiv:2401.02118  [pdf, other

    cs.IT eess.SP

    Radio Map-Based Spectrum Sharing for Joint Communication and Sensing

    Authors: Xionran Fang, Wei Feng, Yunfei Chen, Dingxi Yang, Ning Ge, Zhiyong Feng, Yue Gao

    Abstract: The sixth-generation (6G) network is expected to provide both communication and sensing (C&S) services. However, spectrum scarcity poses a major challenge to the harmonious coexistence of C&S systems. Without effective cooperation, the interference resulting from spectrum sharing impairs the performance of both systems. This paper addresses C&S interference within a distributed network. Different… ▽ More

    Submitted 27 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

  18. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  19. arXiv:2311.18168  [pdf, other

    cs.CV cs.LG eess.AS

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

    Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  20. arXiv:2311.17790  [pdf, other

    cs.SD eess.AS

    FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition

    Authors: Dongning Yang, Wei Wang, Yanmin Qian

    Abstract: Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  21. arXiv:2311.04685  [pdf, other

    eess.IV

    An End-Cloud Computing Enabled Surveillance Video Transmission System

    Authors: Dingxi Yang, Zhijin Qin, Liting Wang, Xiaoming Tao, Fang Cui, Hengjiang Wang

    Abstract: The enormous data volume of video poses a significant burden on the network. Particularly, transferring high-definition surveillance videos to the cloud consumes a significant amount of spectrum resources. To address these issues, we propose a surveillance video transmission system enabled by end-cloud computing. Specifically, the cameras actively down-sample the original video and then a redundan… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  22. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  23. arXiv:2310.04114  [pdf, other

    eess.IV cs.CV

    Aorta Segmentation from 3D CT in MICCAI SEG.A. 2023 Challenge

    Authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu

    Abstract: Aorta provides the main blood supply of the body. Screening of aorta with imaging helps for early aortic disease detection and monitoring. In this work, we describe our solution to the Segmentation of the Aorta (SEG.A.231) from 3D CT challenge. We use automated segmentation method Auto3DSeg available in MONAI. Our solution achieves an average Dice score of 0.920 and 95th percentile of the Hausdorf… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: MICCAI 2023, SEG.A. 2023 challenge 1st place

  24. arXiv:2310.02862  [pdf, other

    cs.LG cs.AI eess.SP

    A novel asymmetrical autoencoder with a sparsifying discrete cosine Stockwell transform layer for gearbox sensor data compression

    Authors: Xin Zhu, Daoguang Yang, Hongyi Pan, Hamid Reza Karimi, Didem Ozevin, Ahmet Enis Cetin

    Abstract: The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  25. arXiv:2310.00704  [pdf, other

    cs.SD eess.AS

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation

    Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

    Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More

    Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  26. arXiv:2309.17269  [pdf, ps, other

    eess.IV cs.CV

    Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

    Authors: Weiwen Zhang, Dawei Yang, Haoxuan Che, An Ran Ran, Carol Y. Cheung, Hao Chen

    Abstract: For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for t… ▽ More

    Submitted 29 September, 2023; originally announced September 2023.

    Comments: 10 pages, 9 figures

  27. arXiv:2309.02285  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS 2: Describing and Generating Voices with Text Prompt

    Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

    Abstract: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text… ▽ More

    Submitted 11 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Demo page: https://speechresearch.github.io/prompttts2

  28. arXiv:2309.01212  [pdf, other

    cs.SD eess.AS

    NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

    Authors: Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou

    Abstract: The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-condition… ▽ More

    Submitted 3 September, 2023; originally announced September 2023.

  29. arXiv:2308.06891  [pdf

    cs.RO eess.SY

    Viia-hand: a Reach-and-grasp Restoration System Integrating Voice interaction, Computer vision and Auditory feedback for Blind Amputees

    Authors: Chunhao Peng, Dapeng Yang, Ming Cheng, Jinghui Dai, Deyu Zhao, Li Jiang

    Abstract: Visual feedback plays a crucial role in the process of amputation patients completing grasping in the field of prosthesis control. However, for blind and visually impaired (BVI) amputees, the loss of both visual and grasping abilities makes the "easy" reach-and-grasp task a feasible challenge. In this paper, we propose a novel multi-sensory prosthesis system helping BVI amputees with sensing, navi… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  30. Cooperative Decision-Making in Shared Spaces: Making Urban Traffic Safer through Human-Machine Cooperation

    Authors: Balint Varga, Dongxu Yang, Sören Hohmann

    Abstract: In this paper, a cooperative decision-making is presented, which is suitable for intention-aware automated vehicle functions. With an increasing number of highly automated and autonomous vehicles on public roads, trust is a very important issue regarding their acceptance in our society. The most challenging scenarios arise at low driving speeds of these highly automated and autonomous vehicles, wh… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Journal ref: 2023 IEEE 21st Jubilee International Symposium on Intelligent Systems and Informatics (SISY)

  31. arXiv:2305.19269  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Make-A-Voice: Unified Voice Synthesis With Discrete Representation

    Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

    Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  32. arXiv:2305.18474  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

    Authors: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  33. arXiv:2305.05835  [pdf, other

    eess.IV cs.CV cs.LG

    Reference-based OCT Angiogram Super-resolution with Learnable Texture Generation

    Authors: Yuyan Ruan, Dawei Yang, Ziqi Tang, An Ran Ran, Carol Y. Cheung, Hao Chen

    Abstract: Optical coherence tomography angiography (OCTA) is a new imaging modality to visualize retinal microvasculature and has been readily adopted in clinics. High-resolution OCT angiograms are important to qualitatively and quantitatively identify potential biomarkers for different retinal diseases accurately. However, one significant problem of OCTA is the inevitable decrease in resolution when increa… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: 12 pages, 11 figures

    MSC Class: 68T07 ACM Class: I.2; I.4

  34. arXiv:2305.02765  [pdf, other

    cs.SD eess.AS

    HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, Yuexian Zou

    Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: The second version of HiFi-Codec

  35. arXiv:2304.12995  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

    Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

    Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  36. arXiv:2304.10167  [pdf

    physics.optics cs.ET eess.IV

    Adaptive coded illumination Fourier ptychography microscopy based on physical neural network

    Authors: Ruiqing Sun, Delong Yang, Yao Hu, Qun Hao, Xin Li, Shaohui Zhang

    Abstract: Fourier Ptychographic Microscopy (FPM) is a computational technique that achieves a large space-bandwidth product imaging. It addresses the challenge of balancing a large field of view and high resolution by fusing information from multiple images taken with varying illumination angles. Nevertheless, conventional FPM framework always suffers from long acquisition time and a heavy computational bur… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

  37. arXiv:2304.03966  [pdf, other

    eess.SY

    A Smart Switch Configuration and Reliability Assessment Method for Large-Scale Offshore Wind Farm Electrical Collector System

    Authors: Xiaochi Ding, Xinwei Shen, Qiuwei Wu, Liming Wang, Dechang Yang

    Abstract: With the development of offshore wind farms (OWFs) in far-offshore and deep-sea areas, each OWF could contain more and more wind turbines and cables, making it imperative to study high-reliability electrical collector system (ECS) for OWF. Enlightened by active distribution network, for OWF, we propose an ECS switch configuration that enables post-fault network recovery, along with a reliability a… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

    Comments: 10 pages

  38. Intention-Aware Decision-Making for Mixed Intersection Scenarios

    Authors: Balint Varga, Dongxu Yang, Soeren Hohmann

    Abstract: This paper presents a white-box intention-aware decision-making for the handling of interactions between a pedestrian and an automated vehicle (AV) in an unsignalized street crossing scenario. Moreover, a design framework has been developed, which enables automated parameterization of the decision-making. This decision-making is designed in such a manner that it can understand pedestrians in urban… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

  39. arXiv:2303.12823  [pdf, other

    eess.SY cs.AI

    Data-Driven Leader-following Consensus for Nonlinear Multi-Agent Systems against Composite Attacks: A Twins Layer Approach

    Authors: Xin Gong, Jintao Peng, Dong Yang, Zhan Shu, Tingwen Huang, Yukang Cui

    Abstract: This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequen… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

  40. arXiv:2303.05681  [pdf, other

    cs.SD eess.AS

    Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

    Authors: Yifei Xin, Dongchao Yang, Yuexian Zou

    Abstract: In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text.… ▽ More

    Submitted 30 March, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

  41. arXiv:2303.05678  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Weakly Supervised Sound Event Detection with Causal Intervention

    Authors: Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

    Abstract: Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we fir… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  42. arXiv:2302.13921  [pdf, other

    eess.IV eess.SP

    Autonomous Polycrystalline Material Decomposition for Hyperspectral Neutron Tomography

    Authors: Mohammad Samin Nur Chowdhury, Diyu Yang, Shimin Tang, Singanallur V. Venkatakrishnan, Hassina Z. Bilheux, Gregery T. Buzzard, Charles A. Bouman

    Abstract: Hyperspectral neutron tomography is an effective method for analyzing crystalline material samples with complex compositions in a non-destructive manner. Since the counts in the hyperspectral neutron radiographs directly depend on the neutron cross-sections, materials may exhibit contrasting neutron responses across wavelengths. Therefore, it is possible to extract the unique signatures associated… ▽ More

    Submitted 21 August, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

  43. arXiv:2302.13652  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

    Abstract: Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-spe… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  44. arXiv:2302.09785  [pdf, other

    eess.IV cs.CV

    Towards Simultaneous Segmentation of Liver Tumors and Intrahepatic Vessels via Cross-attention Mechanism

    Authors: Haopeng Kuang, Dingkang Yang, Shunli Wang, Xiaoying Wang, Lihua Zhang

    Abstract: Accurate visualization of liver tumors and their surrounding blood vessels is essential for noninvasive diagnosis and prognosis prediction of tumors. In medical image segmentation, there is still a lack of in-depth research on the simultaneous segmentation of liver tumors and peritumoral blood vessels. To this end, we collect the first liver tumor, and vessel segmentation benchmark datasets contai… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Comments: accepted to ICASSP 2023

  45. arXiv:2302.07584  [pdf, other

    eess.AS cs.IT cs.SD eess.SP

    Fast and Blind Speech Copy-Move Detection and Localization in Noise

    Authors: Dong Yang, Mingle Liu, Muyong Cao

    Abstract: Copy-move forgery on speech (CMF), coupled with post-processing techniques, presents a great challenge to the forensic detection and localization of tampered areas. Most of the existing CMF detection approaches necessitate pre-segmentation of speech to facilitate similarity calculations among these segments. However, these approaches usually suffer from the problems of uncontrollable computational… ▽ More

    Submitted 8 September, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

  46. arXiv:2301.13662  [pdf, other

    cs.SD eess.AS

    InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

    Abstract: Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined… ▽ More

    Submitted 25 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: Submit to TASLP

  47. arXiv:2301.12661  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

    Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

    Abstract: Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses t… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

    Comments: Audio samples are available at https://Text-to-Audio.github.io

  48. arXiv:2212.11233  [pdf

    eess.IV cs.CR

    Realization Scheme for Visual Cryptography with Computer-generated Holograms

    Authors: Tao Yu, Jinge Ma, Guilin Li, Dongyu Yang, Rui Ma, Yishi Shi

    Abstract: We propose to realize visual cryptography in an indirect way with the help of computer-generated hologram. At present, the recovery method of visual cryptography is mainly superimposed on transparent film or superimposed by computer equipment, which greatly limits the application range of visual cryptography. In this paper, the shares of the visual cryptography were encoded with computer-generated… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

    Comments: International Workshop on Holography and related technologies (IWH) 2018

  49. arXiv:2212.07048  [pdf, other

    cs.CV eess.IV

    PD-Quant: Post-Training Quantization based on Prediction Difference Metric

    Authors: Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, Wenyu Liu

    Abstract: Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep neural networks, it can also introduce quantization noise and reduce prediction accuracy, especially in extremely low-bit settings. How to determine the appropriat… ▽ More

    Submitted 27 March, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

  50. arXiv:2212.00647  [pdf, other

    eess.IV physics.med-ph

    An Edge Alignment-based Orientation Selection Method for Neutron Tomography

    Authors: Diyu Yang, Shimin Tang, Singanallur V. Venkatakrishnan, Mohammad S. N. Chowdhury, Yuxuan Zhang, Hassina Z. Bilheux, Gregery T. Buzzard, Charles A. Bouman

    Abstract: Neutron computed tomography (nCT) is a 3D characterization technique used to image the internal morphology or chemical composition of samples in biology and materials sciences. A typical workflow involves placing the sample in the path of a neutron beam, acquiring projection data at a predefined set of orientations, and processing the resulting data using an analytic reconstruction algorithm. Typi… ▽ More

    Submitted 8 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.