Search | arXiv e-print repository

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Authors: Rubing Shen, Yanzhen Ren, Zongkun Sun

Abstract: Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the ali… ▽ More Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2305.12552 [pdf, other]

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

Authors: Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao

Abstract: Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples t… ▽ More Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data. In this work, we propose the first direct speech-to-SQL parsing model Wav2SQL which avoids error compounding across cascaded systems. Specifically, 1) to accelerate speech-driven SQL parsing research in the community, we release a large-scale and multi-speaker dataset MASpider; 2) leveraging the recent progress in the large-scale pre-training, we show that it alleviates the data scarcity issue and allow for direct speech-to-SQL parsing; and 3) we include the speech re-programming and gradient reversal classifier techniques to reduce acoustic variance and learned style-agnostic representation, improving generalization to unseen out-of-domain custom data. Experimental results demonstrate that Wav2SQL avoids error compounding and achieves state-of-the-art results by up to 2.5\% accuracy improvement over the baseline. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2305.05152 [pdf, other]

Who is Speaking Actually? Robust and Versatile Speaker Traceability for Voice Conversion

Authors: Yanzhen Ren, Hongcheng Zhu, Liming Zhai, Zongkun Sun, Rubing Shen, Lina Wang

Abstract: Voice conversion (VC), as a voice style transfer technology, is becoming increasingly prevalent while raising serious concerns about its illegal use. Proactively tracing the origins of VC-generated speeches, i.e., speaker traceability, can prevent the misuse of VC, but unfortunately has not been extensively studied. In this paper, we are the first to investigate the speaker traceability for VC and… ▽ More Voice conversion (VC), as a voice style transfer technology, is becoming increasingly prevalent while raising serious concerns about its illegal use. Proactively tracing the origins of VC-generated speeches, i.e., speaker traceability, can prevent the misuse of VC, but unfortunately has not been extensively studied. In this paper, we are the first to investigate the speaker traceability for VC and propose a traceable VC framework named VoxTracer. Our VoxTracer is similar to but beyond the paradigm of audio watermarking. We first use unique speaker embedding to represent speaker identity. Then we design a VAE-Glow structure, in which the hiding process imperceptibly integrates the source speaker identity into the VC, and the tracing process accurately recovers the source speaker identity and even the source speech in spite of severe speech quality degradation. To address the speech mismatch between the hiding and tracing processes affected by different distortions, we also adopt an asynchronous training strategy to optimize the VAE-Glow models. The VoxTracer is versatile enough to be applied to arbitrary VC methods and popular audio coding standards. Extensive experiments demonstrate that the VoxTracer achieves not only high imperceptibility in hiding, but also nearly 100% tracing accuracy against various types of audio lossy compressions (AAC, MP3, Opus and SILK) with a broad range of bitrates (16 kbps - 128 kbps) even in a very short time duration (0.74s). Our speech demo is available at https://anonymous.4open.science/w/DEMOofVoxTracer. △ Less

Submitted 26 July, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: has been accepted by ACM MM 2023

arXiv:2212.07656 [pdf, other]

Hybrid stability augmentation control of multi-rotor UAV in confined space based on adaptive backstepping control

Authors: QuanXi Zhan, JunRui Zhang, ChenYang Sun, RunJie Shen, Bin He

Abstract: This paper applies the UAV to the inspection of water diversion pipelines in hydropower stations. The diversion pipeline is an enclosed space, so the airflow disturbance caused by the rotation of the UAV blades and the strong air convection from the chimney effect have a great impact on the flight control of the UAV. Although the traditional linear control PID flight control algorithm has been wid… ▽ More This paper applies the UAV to the inspection of water diversion pipelines in hydropower stations. The diversion pipeline is an enclosed space, so the airflow disturbance caused by the rotation of the UAV blades and the strong air convection from the chimney effect have a great impact on the flight control of the UAV. Although the traditional linear control PID flight control algorithm has been widely used and can meet the requirements of general flight tasks, it cannot guarantee the stability of the system over a wide range. The inspection of a diversion line in an enclosed space requires high system stability and robustness of the UAV controller. In this paper, a hybrid stabilised adaptive backstepping control method is proposed. Firstly, a multi-rotor UAV model is analysed and transformed into a strict feedback form with external disturbances; then adaptive techniques are used to estimate the airflow disturbances caused by the blades, and the attitude and position tracking controllers are designed by combining backstepping control and PID control respectively; finally, the asymptotic stability of the system is ensured by constructing a Lyapunov function. The experimental data show that the flight controller designed in this paper has good robustness and tracking performance, and can better resist the disturbance caused by airflow disturbance in confined space. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: 7 pages

arXiv:2103.09135 [pdf, other]

Air-to-Ground Directional Channel Sounder With 64-antenna Dual-polarized Cylindrical Array

Authors: Jorge Gomez Ponce, Thomas Choi, Naveed A. Abbasi, Aldo Adame, Alexander Alvarado, Colton Bullard, Ruiyi Shen, Fred Daneshgaran, Harpreet S. Dhillon, Andreas F. Molisch

Abstract: Unmanned Aerial Vehicles (UAVs), popularly called drones, are an important part of future wireless communications, either as user equipment that needs communication with a ground station, or as base station in a 3D network. For both the analysis of the "useful" links, and for investigation of possible interference to other ground-based nodes, an understanding of the air-to-ground channel is requir… ▽ More Unmanned Aerial Vehicles (UAVs), popularly called drones, are an important part of future wireless communications, either as user equipment that needs communication with a ground station, or as base station in a 3D network. For both the analysis of the "useful" links, and for investigation of possible interference to other ground-based nodes, an understanding of the air-to-ground channel is required. Since ground-based nodes often are equipped with antenna arrays, the channel investigations need to account for it. This study presents a massive MIMO-based air-to-ground channel sounder we have recently developed in our lab, which can perform measurements for the aforementioned requirements. After outlining the principle and functionality of the sounder, we present sample measurements that demonstrate the capabilities, and give first insights into air-to-ground massive MIMO channels in an urban environment. Our results provide a platform for future investigations and possible enhancements of massive MIMO systems. △ Less

Submitted 13 February, 2021; originally announced March 2021.

arXiv:1405.0806 [pdf]

Design of a capacitor-less low-dropout voltage regulator

Authors: X. R. Li, D. B. Pei, Q. Liu, R. Shen

Abstract: A solution to the stability of capacitor-less low-dropout regulators with a 4pF Miller capacitor in Multi-level current amplifier is proposed. With the Miller compensation, a more than 50°phase margin is guaranteed in full load. An extra fast transient circuit is adopted to reduce stable time and peak voltage. When the load changes from light to heavy, the peak voltage is 40mV and chip quiescent c… ▽ More A solution to the stability of capacitor-less low-dropout regulators with a 4pF Miller capacitor in Multi-level current amplifier is proposed. With the Miller compensation, a more than 50°phase margin is guaranteed in full load. An extra fast transient circuit is adopted to reduce stable time and peak voltage. When the load changes from light to heavy, the peak voltage is 40mV and chip quiescent current is only 45uA. △ Less

Submitted 5 May, 2014; originally announced May 2014.

Comments: 6pages,7figures

Showing 1–6 of 6 results for author: Shen, R