Search | arXiv e-print repository

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Authors: Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Conghui He, Dahua Lin, Jiaqi Wang

Abstract: We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song represen… ▽ More We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song representation, the mature and efficient way humans designed for music, and enable LLM to explicitly compose songs like humans. In practice, we design a novel tuple design to format lyric and three note attributes (pitch, duration, and rest duration) in the melody, which guarantees the correct LLM understanding of musical symbols and realizes precise alignment between lyrics and melody. To impart basic music understanding to LLM, we carefully collected SongCompose-PT, a large-scale song pretraining dataset that includes lyrics, melodies, and paired lyrics-melodies in either Chinese or English. After adequate pre-training, 10K carefully crafted QA pairs are used to empower the LLM with the instruction-following capability and solve diverse tasks. With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation, outperforming advanced LLMs like GPT-4. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: project page: https://pjlab-songcomposer.github.io/ code: https://github.com/pjlab-songcomposer/songcomposer

arXiv:2401.12238 [pdf, other]

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Authors: Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, Juan P. Bello

Abstract: Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific… ▽ More Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 1 table, to be presented at ICASSP 2024 in Seoul, South Korea

arXiv:2312.08553 [pdf, other]

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

Authors: Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios… ▽ More End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies. △ Less

Submitted 16 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024. Preprint

arXiv:2311.07703 [pdf, other]

Measuring Entrainment in Spontaneous Code-switched Speech

Authors: Debasmita Bhattacharya, Siying Ding, Alayna Nguyen, Julia Hirschberg

Abstract: It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such s… ▽ More It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in code-switched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially "universal" nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology. △ Less

Submitted 26 March, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Edits: camera-ready manuscript for NAACL 2024

arXiv:2310.08418 [pdf, ps, other]

doi 10.1109/TSG.2024.3420743

Privacy-Preserved Aggregate Thermal Dynamic Model of Buildings

Authors: Zeyin Hou, Shuai Lu, Yijun Xu, Haifeng Qiu, Wei Gu, Zhaoyang Dong, Shixing Ding

Abstract: The thermal inertia of buildings brings considerable flexibility to the heating and cooling load, which is known to be a promising demand response resource. The aggregate model that can describe the thermal dynamics of the building cluster is an important interference for energy systems to exploit its intrinsic thermal inertia. However, the private information of users, such as the indoor temperat… ▽ More The thermal inertia of buildings brings considerable flexibility to the heating and cooling load, which is known to be a promising demand response resource. The aggregate model that can describe the thermal dynamics of the building cluster is an important interference for energy systems to exploit its intrinsic thermal inertia. However, the private information of users, such as the indoor temperature and heating/cooling power, needs to be collected in the parameter estimation procedure to obtain the aggregate model, causing severe privacy concerns. In light of this, we propose a novel privacy-preserved parameter estimation approach to infer the aggregate model for the thermal dynamics of the building cluster for the first time. Using it, the parameters of the aggregate thermal dynamic model (ATDM) can be obtained by the load aggregator without accessing the individual's privacy information. More specifically, this method not only exploits the block coordinate descent (BCD) method to resolve its non-convexity in the estimation but investigates the transformation-based encryption (TE) associated with its secure aggregation protocol (SAP) techniques to realize privacy-preserved computation. Its capability of preserving privacy is also theoretically proven. Finally, simulation results using real-world data demonstrate the accuracy and privacy-preserved performance of our proposed method. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.09953 [pdf, other]

PINN-based viscosity solution of HJB equation

Authors: Tianyu Liu, Steven Ding, Jiarui Zhang, Liutao Zhou

Abstract: This paper proposed a novel PINN-based viscosity solution for HJB equations. Although there exists work using PINN to solve HJB, but none of them gives the solution in viscosity sense. This paper reveals the fact that using the convex neural network, one can guarantee the viscosity solution and thus the neural network can easily converge to the true solution of HJB despite of the starting point. This paper proposed a novel PINN-based viscosity solution for HJB equations. Although there exists work using PINN to solve HJB, but none of them gives the solution in viscosity sense. This paper reveals the fact that using the convex neural network, one can guarantee the viscosity solution and thus the neural network can easily converge to the true solution of HJB despite of the starting point. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.02732 [pdf, other]

A study on fault diagnosis in nonlinear dynamic systems with uncertainties

Authors: Steven X. Ding, Linlin Li

Abstract: In this draft, fault diagnosis in nonlinear dynamic systems is addressed. The objective of this work is to establish a framework, in which not only model-based but also data-driven and machine learning based fault diagnosis strategies can be uniformly handled. Instead of the well-established input-output and the associated state space models, stable image and kernel representations are adopted in… ▽ More In this draft, fault diagnosis in nonlinear dynamic systems is addressed. The objective of this work is to establish a framework, in which not only model-based but also data-driven and machine learning based fault diagnosis strategies can be uniformly handled. Instead of the well-established input-output and the associated state space models, stable image and kernel representations are adopted in our work as the basic process model forms. Based on it, the nominal system dynamics can then be modelled as a lower-dimensional manifold embedded in the process data space. To achieve a reliable fault detection as a classification problem, projection technique is a capable tool. For nonlinear dynamic systems, we propose to construct projection systems in the well-established framework of Hamiltonian systems and by means of the normalised image and kernel representations. For nonlinear dynamic systems, process data form a non-Euclidean space. Consequently, the norm-based distance defined in Hilbert space is not suitable to measure the distance from a data vector to the manifold of the nominal dynamics. To deal with this issue, we propose to use a Bregman divergence, a measure of difference between two points in a space, as a solution. Moreover, for our purpose of achieving a performance-oriented fault detection, the Bregman divergences adopted in our work are defined by Hamiltonian functions. This scheme not only enables to realise the performance-oriented fault detection, but also uncovers the information geometric aspect of our work. The last part of our work is devoted to the kernel representation based fault detection and uncertainty estimation that can be equivalently used for fault estimation. It is demonstrated that the projection onto the manifold of uncertainty data, together with the correspondingly defined Bregman divergence, is also capable for fault detection. △ Less

Submitted 26 October, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

arXiv:2306.04286 [pdf, other]

A Mask Free Neural Network for Monaural Speech Enhancement

Authors: Liang Liu, Haixin Guan, Jinlong Ma, Wei Dai, Guangyong Wang, Shaowei Ding

Abstract: In speech enhancement, the lack of clear structural characteristics in the target speech phase requires the use of conservative and cumbersome network frameworks. It seems difficult to achieve competitive performance using direct methods and simple network architectures. However, we propose the MFNet, a direct and simple network that can not only map speech but also map reverse noise. This network… ▽ More In speech enhancement, the lack of clear structural characteristics in the target speech phase requires the use of conservative and cumbersome network frameworks. It seems difficult to achieve competitive performance using direct methods and simple network architectures. However, we propose the MFNet, a direct and simple network that can not only map speech but also map reverse noise. This network is constructed by stacking global local former blocks (GLFBs), which combine the advantages of Mobileblock for global processing and Metaformer architecture for local interaction. Our experimental results demonstrate that our network using mapping method outperforms masking methods, and direct mapping of reverse noise is the optimal solution in strong noise environments. In a horizontal comparison on the 2020 Deep Noise Suppression (DNS) challenge test set without reverberation, to the best of our knowledge, MFNet is the current state-of-the-art (SOTA) mapping model. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2306.02020 [pdf, ps, other]

Replay Attack Detection Based on Parity Space Method for Cyber-Physical Systems

Authors: Dong Zhao, Yang Shi, Steven X. Ding, Yueyang Li, Fangzhou Fu

Abstract: The replay attack detection problem is studied from a new perspective based on parity space method in this paper. The proposed detection methods have the ability to distinguish system fault and replay attack, handle both input and output data replay, maintain certain control performance, and can be implemented conveniently and efficiently. First, the replay attack effect on the residual is derived… ▽ More The replay attack detection problem is studied from a new perspective based on parity space method in this paper. The proposed detection methods have the ability to distinguish system fault and replay attack, handle both input and output data replay, maintain certain control performance, and can be implemented conveniently and efficiently. First, the replay attack effect on the residual is derived and analyzed. The residual change induced by replay attack is characterized explicitly and the detection performance analysis based on two different test statistics are given. Second, based on the replay attack effect characterization, targeted passive and active design for detection performance enhancement are proposed. Regarding the passive design, four optimization schemes regarding different cost functions are proposed with optimal parity matrix solutions, and the unified solution to the passive optimization schemes is obtained; the active design is enabled by a marginally stable filter so as to enlarge the replay attack effect on the residual for detection. Simulations and comparison studies are given to show the effectiveness of the proposed methods. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.16619 [pdf, other]

2-bit Conformer quantization for automatic speech recognition

Authors: Oleg Rybakov, Phoenix Meadowlark, Shaojin Ding, David Qiu, Jian Li, David Rim, Yanzhang He

Abstract: Large speech models are rapidly gaining traction in research community. As a result, model compression has become an important topic, so that these models can fit in memory and be served with reduced cost. Practical approaches for compressing automatic speech recognition (ASR) model use int8 or int4 weight quantization. In this study, we propose to develop 2-bit ASR models. We explore the impact o… ▽ More Large speech models are rapidly gaining traction in research community. As a result, model compression has become an important topic, so that these models can fit in memory and be served with reduced cost. Practical approaches for compressing automatic speech recognition (ASR) model use int8 or int4 weight quantization. In this study, we propose to develop 2-bit ASR models. We explore the impact of symmetric and asymmetric quantization combined with sub-channel quantization and clipping on both LibriSpeech dataset and large-scale training data. We obtain a lossless 2-bit Conformer model with 32% model size reduction when compared to state of the art 4-bit Conformer model for LibriSpeech. With the large-scale training data, we obtain a 2-bit Conformer model with over 40% model size reduction against the 4-bit version at the cost of 17% relative word error rate degradation △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: submitted to Interspeech

arXiv:2305.15536 [pdf, other]

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

Authors: David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He

Abstract: With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have differen… ▽ More With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have different training dynamics compared to sequence tasks. In this paper, we first benchmark the impact of popular techniques such as straight through estimator, pseudo-quantization noise, learnable scale parameter, clipping, etc. on 4-bit seq2seq models across a suite of speech recognition datasets ranging from 1,000 hours to 1 million hours, as well as one machine translation dataset to illustrate its applicability outside of speech. Through the experiments, we report that noise based QAT suffers when there is insufficient regularization signal flowing back to the quantization scale. We propose low complexity changes to the QAT process to improve model accuracy (outperforming popular learnable scale and clipping methods). With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT: 1) training a single model that performs well in mixed precision mode and 2) improved generalization on long form speech recognition. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.03997 [pdf, other]

Dual Degradation Representation for Joint Deraining and Low-Light Enhancement in the Dark

Authors: Xin Lin, Jingtong Yue, Sixian Ding, Chao Ren, Lu Qi, Ming-Hsuan Yang

Abstract: Rain in the dark poses a significant challenge to deploying real-world applications such as autonomous driving, surveillance systems, and night photography. Existing low-light enhancement or deraining methods struggle to brighten low-light conditions and remove rain simultaneously. Additionally, cascade approaches like ``deraining followed by low-light enhancement'' or the reverse often result in… ▽ More Rain in the dark poses a significant challenge to deploying real-world applications such as autonomous driving, surveillance systems, and night photography. Existing low-light enhancement or deraining methods struggle to brighten low-light conditions and remove rain simultaneously. Additionally, cascade approaches like ``deraining followed by low-light enhancement'' or the reverse often result in problematic rain patterns or overly blurred and overexposed images. To address these challenges, we introduce an end-to-end model called L$^{2}$RIRNet, designed to manage both low-light enhancement and deraining in real-world settings. Our model features two main components: a Dual Degradation Representation Network (DDR-Net) and a Restoration Network. The DDR-Net independently learns degradation representations for luminance effects in dark areas and rain patterns in light areas, employing dual degradation loss to guide the training process. The Restoration Network restores the degraded image using a Fourier Detail Guidance (FDG) module, which leverages near-rainless detailed images, focusing on texture details in frequency and spatial domains to inform the restoration process. Furthermore, we contribute a dataset containing both synthetic and real-world low-light-rainy images. Extensive experiments demonstrate that our L$^{2}$RIRNet performs favorably against existing methods in both synthetic and complex real-world scenarios. All the code and dataset can be found in \url{https://github.com/linxin0/Low_light_rainy}. △ Less

Submitted 17 June, 2024; v1 submitted 6 May, 2023; originally announced May 2023.

arXiv:2303.08343 [pdf, ps, other]

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Authors: Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw

Abstract: Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model siz… ▽ More Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model size of Conformer-based speech recognition models which typically require models with greater than 100M parameters down to just $5$M parameters while minimizing impact on model quality. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors. We propose model weight reuse at different levels within our model architecture: (i) repeating full conformer block layers, (ii) sharing specific conformer modules across layers, (iii) sharing sub-components per conformer module, and (iv) sharing decomposed sub-component weights after low-rank decomposition. By sharing weights at different levels of our model, we can retain the full model in-memory while increasing the number of virtual transformations applied to the input. Through a series of ablation studies and evaluations, we find that with weight sharing and a low-rank architecture, we can achieve a WER of 2.84 and 2.94 for Librispeech dev-clean and test-clean respectively with a $5$M parameter model. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted to IEEE ICASSP 2023

arXiv:2211.02718 [pdf, other]

SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

Authors: Siwen Ding, You Zhang, Zhiyao Duan

Abstract: Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consid… ▽ More Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2208.10933 [pdf]

doi 10.1021/acsnano.2c06432

Large-Scale Integrated Flexible Tactile Sensor Array for Sensitive Smart Robotic Touch

Authors: Zhenxuan Zhao, Jianshi Tang, Jian Yuan, Yijun Li, Yuan Dai, Jian Yao, Qingtian Zhang, Sanchuan Ding, Tingyu Li, Ruirui Zhang, Yu Zheng, Zhengyou Zhang, Song Qiu, Qingwen Li, Bin Gao, Ning Deng, He Qian, Fei Xing, Zheng You, Huaqiang Wu

Abstract: In the long pursuit of smart robotics, it has been envisioned to empower robots with human-like senses, especially vision and touch. While tremendous progress has been made in image sensors and computer vision over the past decades, the tactile sense abilities are lagging behind due to the lack of large-scale flexible tactile sensor array with high sensitivity, high spatial resolution, and fast re… ▽ More In the long pursuit of smart robotics, it has been envisioned to empower robots with human-like senses, especially vision and touch. While tremendous progress has been made in image sensors and computer vision over the past decades, the tactile sense abilities are lagging behind due to the lack of large-scale flexible tactile sensor array with high sensitivity, high spatial resolution, and fast response. In this work, we have demonstrated a 64x64 flexible tactile sensor array with a record-high spatial resolution of 0.9 mm (equivalently 28.2 pixels per inch), by integrating a high-performance piezoresistive film (PRF) with a large-area active matrix of carbon nanotube thin-film transistors. PRF with self-formed microstructures exhibited high pressure-sensitivity of ~385 kPa-1 for MWCNTs concentration of 6%, while the 14% one exhibited fast response time of ~3 ms, good linearity, broad detection range beyond 1400 kPa, and excellent cyclability over 3000 cycles. Using this fully integrated tactile sensor array, the footprint maps of an artificial honeybee were clearly identified. Furthermore, we hardware-implemented a smart tactile system by integrating the PRF-based sensor array with a memristor-based computing-in-memory chip to record and recognize handwritten digits and Chinese calligraphy, achieving high classification accuracies of 98.8% and 97.3% in hardware, respectively. The integration of sensor networks with deep learning hardware may enable edge or near-sensor computing with significantly reduced power consumption and latency. Our work could pave the road to building large-scale intelligent sensor networks for next-generation smart robotics. △ Less

Submitted 3 November, 2022; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: Correction in Methods: The weight ratio of TPU:DMF was set to be 1:5

Journal ref: ACS Nano 2022, 16, 16784

arXiv:2208.06411 [pdf, ps, other]

SFF-DA: Sptialtemporal Feature Fusion for Detecting Anxiety Nonintrusively

Authors: Haimiao Mo, Yuchen Li, Shanlin Yang, Wei Zhang, Shuai Ding

Abstract: Early detection of anxiety is crucial for reducing the suffering of individuals with mental disorders and improving treatment outcomes. Utilizing an mHealth platform for anxiety screening can be particularly practical in improving screening efficiency and reducing costs. However, the effectiveness of existing methods has been hindered by differences in mobile devices used to capture subjects' phys… ▽ More Early detection of anxiety is crucial for reducing the suffering of individuals with mental disorders and improving treatment outcomes. Utilizing an mHealth platform for anxiety screening can be particularly practical in improving screening efficiency and reducing costs. However, the effectiveness of existing methods has been hindered by differences in mobile devices used to capture subjects' physical and mental evaluations, as well as by the variability in data quality and small sample size problems encountered in real-world settings. To address these issues, we propose a framework with spatiotemporal feature fusion for detecting anxiety nonintrusively. We use a feature extraction network based on a 3D convolutional network and long short-term memory ("3DCNN+LSTM") to fuse the spatiotemporal features of facial behavior and noncontact physiology, which reduces the impact of uneven data quality. Additionally, we design a similarity assessment strategy to address the issue of deteriorating model accuracy due to small sample sizes. Our framework is validated with a crew dataset from the real world and two public datasets: the University of Burgundy Franche-Comté Psychophysiological (UBFC-Phys) dataset and the Smart Reasoning for Well-being at Home and at Work for Knowledge Work (SWELL-KW) dataset. The experimental results indicate that our framework outperforms the comparison methods. △ Less

Submitted 8 March, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

arXiv:2208.01291 [pdf, other]

Control theoretically explainable application of autoencoder methods to fault detection in nonlinear dynamic systems

Authors: Linlin Li, Steven X. Ding, Ketian Liang, Zhiwen Chen, Ting Xue

Abstract: This paper is dedicated to control theoretically explainable application of autoencoders to optimal fault detection in nonlinear dynamic systems. Autoencoder-based learning is a standard machine learning method and widely applied for fault (anomaly) detection and classification. In the context of representation learning, the so-called latent (hidden) variable plays an important role towards an opt… ▽ More This paper is dedicated to control theoretically explainable application of autoencoders to optimal fault detection in nonlinear dynamic systems. Autoencoder-based learning is a standard machine learning method and widely applied for fault (anomaly) detection and classification. In the context of representation learning, the so-called latent (hidden) variable plays an important role towards an optimal fault detection. In ideal case, the latent variable should be a minimal sufficient statistic. The existing autoencoder-based fault detection schemes are mainly application-oriented, and few efforts have been devoted to optimal autoencoder-based fault detection and explainable applications. The main objective of our work is to establish a framework for learning autoencoder-based optimal fault detection in nonlinear dynamic systems. To this aim, a process model form for dynamic systems is firstly introduced with the aid of control theory, which also leads to a clear system interpretation of the latent variable. The major efforts are made on the development of a control theoretic solution to the optimal fault detection problem, in which an analog concept to minimal sufficient statistic, the so-called lossless information compression, is introduced and proven for dynamic systems and fault detection specifications. In particular, the existence conditions for such a latent variable are derived, based on which a loss function and further a learning algorithm are developed. This learning algorithm enables optimally training of autoencoders to achieve an optimal fault detection in nonlinear dynamic systems. A case study on three-tank system is given at the end of this paper to illustrate the capability of the proposed autoencoder-based fault detection and to explain the essential role of the latent variable in the proposed fault detection system. △ Less

Submitted 15 May, 2023; v1 submitted 2 August, 2022; originally announced August 2022.

arXiv:2204.08466 [pdf, other]

Robust PCA Unrolling Network for Super-resolution Vessel Extraction in X-ray Coronary Angiography

Authors: Binjie Qin, Haohao Mao, Yiming Liu, Jun Zhao, Yisong Lv, Yueqi Zhu, Song Ding, Xu Chen

Abstract: Although robust PCA has been increasingly adopted to extract vessels from X-ray coronary angiography (XCA) images, challenging problems such as inefficient vessel-sparsity modelling, noisy and dynamic background artefacts, and high computational cost still remain unsolved. Therefore, we propose a novel robust PCA unrolling network with sparse feature selection for super-resolution XCA vessel imagi… ▽ More Although robust PCA has been increasingly adopted to extract vessels from X-ray coronary angiography (XCA) images, challenging problems such as inefficient vessel-sparsity modelling, noisy and dynamic background artefacts, and high computational cost still remain unsolved. Therefore, we propose a novel robust PCA unrolling network with sparse feature selection for super-resolution XCA vessel imaging. Being embedded within a patch-wise spatiotemporal super-resolution framework that is built upon a pooling layer and a convolutional long short-term memory network, the proposed network can not only gradually prune complex vessel-like artefacts and noisy backgrounds in XCA during network training but also iteratively learn and select the high-level spatiotemporal semantic information of moving contrast agents flowing in the XCA-imaged vessels. The experimental results show that the proposed method significantly outperforms state-of-the-art methods, especially in the imaging of the vessel network and its distal vessels, by restoring the intensity and geometry profiles of heterogeneous vessels against complex and dynamic backgrounds. △ Less

Submitted 23 April, 2022; v1 submitted 16 April, 2022; originally announced April 2022.

arXiv:2204.06164 [pdf, other]

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Authors: Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

Abstract: In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separa… ▽ More In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnel-pooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss, while substantially reducing the engineering efforts of having separate models. △ Less

Submitted 24 June, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Accepted by INTERSPEECH 2022

arXiv:2204.03793 [pdf, other]

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Authors: Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw

Abstract: Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although… ▽ More Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method. △ Less

Submitted 24 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted by INTERSPEECH 2022

arXiv:2203.15952 [pdf, other]

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Authors: Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Abstract: Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compr… ▽ More Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compression rate without introducing additional performance regression, in this study, we propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference. We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique. First, we explored the impact of different precisions for both weight and activation quantization on the LibriSpeech dataset, and obtained a lossless 4-bit Conformer model with 5.8x size reduction compared to the float32 model. Following this, we for the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets, and produced a lossless Conformer ASR model with mixed 4-bit and 8-bit weights that has 5x size reduction compared to the float32 model. △ Less

Submitted 2 March, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Published at INTERSPEECH 2022

arXiv:2202.08108 [pdf, other]

An alternative paradigm of fault diagnosis in dynamic systems: orthogonal projection-based methods

Authors: Steven X. Ding, Linlin Li, Tianyu Liu

Abstract: In this paper, we propose a new paradigm of fault diagnosis in dynamic systems as an alternative to the well-established observer-based framework. The basic idea behind this work is to (i) formulate fault detection and isolation as projection of measurement signals onto (system) subspaces in Hilbert space, and (ii) solve the resulting problems by means of projection methods with orthogonal project… ▽ More In this paper, we propose a new paradigm of fault diagnosis in dynamic systems as an alternative to the well-established observer-based framework. The basic idea behind this work is to (i) formulate fault detection and isolation as projection of measurement signals onto (system) subspaces in Hilbert space, and (ii) solve the resulting problems by means of projection methods with orthogonal projection operators and gap metric as major tools. In the new framework, fault diagnosis issues are uniformly addressed both in the model-based and data-driven fashions. Moreover, the design and implementation of the projection-based fault diagnosis systems, from residual generation to threshold setting, can be unifiedly handled. Thanks to the well-defined distance metric for projections in Hilbert subspaces, the projection-based fault diagnosis systems deliver optimal fault detectability. In particular, a new type of residual-driven thresholds is proposed, which significantly increases the fault detectability. In this work, various design schemes are proposed, including a basic projection-based fault detection scheme, fault detection schemes for feedback control systems, fault classification as well as two modified fault detection schemes. As a part of our study, relations to the existing observer-based fault detection systems are investigated, which showcases that, with comparable online computations, the proposed projection-based detection methods offer improved detection performance. △ Less

Submitted 7 May, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

arXiv:2111.08185 [pdf, other]

Graph neural network-based fault diagnosis: a review

Authors: Zhiwen Chen, Jiamin Xu, Cesare Alippi, Steven X. Ding, Yuri Shardt, Tao Peng, Chunhua Yang

Abstract: Graph neural network (GNN)-based fault diagnosis (FD) has received increasing attention in recent years, due to the fact that data coming from several application domains can be advantageously represented as graphs. Indeed, this particular representation form has led to superior performance compared to traditional FD approaches. In this review, an easy introduction to GNN, potential applications t… ▽ More Graph neural network (GNN)-based fault diagnosis (FD) has received increasing attention in recent years, due to the fact that data coming from several application domains can be advantageously represented as graphs. Indeed, this particular representation form has led to superior performance compared to traditional FD approaches. In this review, an easy introduction to GNN, potential applications to the field of fault diagnosis, and future perspectives are given. First, the paper reviews neural network-based FD methods by focusing on their data representations, namely, time-series, images, and graphs. Second, basic principles and principal architectures of GNN are introduced, with attention to graph convolutional networks, graph attention networks, graph sample and aggregate, graph auto-encoder, and spatial-temporal graph convolutional networks. Third, the most relevant fault diagnosis methods based on GNN are validated through the detailed experiments, and conclusions are made that the GNN-based methods can achieve good fault diagnosis performance. Finally, discussions and future challenges are provided. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: 17 pages, 18 figures, 10 tables

arXiv:2111.04566 [pdf, other]

doi 10.1145/3384419.3430735

RF-Net: a Unified Meta-learning Framework for RF-enabled One-shot Human Activity Recognition

Authors: Shuya Ding, Zhe Chen, Tianyue Zheng, Jun Luo

Abstract: Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications. However, device-free (or contactless) sensing is often more sensitive to environment changes than device-based (or wearable) sensing. Also, RF datasets strictly require on-line labeling during collection, starkly different from image and text data collections where human int… ▽ More Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications. However, device-free (or contactless) sensing is often more sensitive to environment changes than device-based (or wearable) sensing. Also, RF datasets strictly require on-line labeling during collection, starkly different from image and text data collections where human interpretations can be leveraged to perform off-line labeling. Therefore, existing solutions to RF-HAR entail a laborious data collection process for adapting to new environments. To this end, we propose RF-Net as a meta-learning based approach to one-shot RF-HAR; it reduces the labeling efforts for environment adaptation to the minimum level. In particular, we first examine three representative RF sensing techniques and two major meta-learning approaches. The results motivate us to innovate in two designs: i) a dual-path base HAR network, where both time and frequency domains are dedicated to learning powerful RF features including spatial and attention-based temporal ones, and ii) a metric-based meta-learning framework to enhance the fast adaption capability of the base network, including an RF-specific metric module along with a residual classification module. We conduct extensive experiments based on all three RF sensing techniques in multiple real-world indoor environments; all results strongly demonstrate the efficacy of RF-Net compared with state-of-the-art baselines. △ Less

Submitted 28 October, 2021; originally announced November 2021.

Comments: 14 pages

Journal ref: SenSys '20: Proceedings of the 18th Conference on Embedded Networked Sensor Systems, November 2020

arXiv:2110.14849 [pdf, other]

doi 10.1109/MCOM.001.2000288

Enhancing RF Sensing with Deep Learning: A Layered Approach

Authors: Tianyue Zheng, Zhe Chen, Shuya Ding, Jun Luo

Abstract: In recent years, radio frequency (RF) sensing has gained increasing popularity due to its pervasiveness, low cost, non-intrusiveness, and privacy preservation. However, realizing the promises of RF sensing is highly nontrivial, given typical challenges such as multipath and interference. One potential solution leverages deep learning to build direct mappings from the RF domain to target domains, h… ▽ More In recent years, radio frequency (RF) sensing has gained increasing popularity due to its pervasiveness, low cost, non-intrusiveness, and privacy preservation. However, realizing the promises of RF sensing is highly nontrivial, given typical challenges such as multipath and interference. One potential solution leverages deep learning to build direct mappings from the RF domain to target domains, hence avoiding complex RF physical modeling. While earlier solutions exploit only simple feature extraction and classification modules, an emerging trend adds functional layers on top of elementary modules for more powerful generalizability and flexible applicability. To better understand this potential, this article takes a layered approach to summarize RF sensing enabled by deep learning. Essentially, we present a four-layer framework: physical, backbone, generalization, and application. While this layered framework provides readers a systematic methodology for designing deep interpreted RF sensing, it also facilitates making improvement proposals and hints at future research opportunities. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: 7 pages

Journal ref: IEEE Communications Magazine ( Volume: 59, Issue: 2, February 2021)

arXiv:2110.04482 [pdf, other]

Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis

Authors: Mu Yang, Shaojin Ding, Tianlong Chen, Tong Wang, Zhangyang Wang

Abstract: This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually. It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden. One of the challenges of lifelong learning methods is "catastrophic forgetting": in… ▽ More This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually. It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden. One of the challenges of lifelong learning methods is "catastrophic forgetting": in TTS scenario it means that model performance quickly degrades on previous languages when adapted to a new language. We approach this problem via a data-replay-based lifelong learning method. We formulate the replay process as a supervised learning problem, and propose a simple yet effective dual-sampler framework to tackle the heavily language-imbalanced training samples. Through objective and subjective evaluations, we show that this supervised learning formulation outperforms other gradient-based and regularization-based lifelong learning methods, achieving 43% Mel-Cepstral Distortion reduction compared to a fine-tuning baseline. △ Less

Submitted 18 May, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

Comments: Accepted to ICASSP 2022. Camera-ready

arXiv:2108.04212 [pdf, other]

AutoVideo: An Automated Video Action Recognition System

Authors: Daochen Zha, Zaid Pervaiz Bhat, Yi-Wei Chen, Yicheng Wang, Sirui Ding, Jiaben Chen, Kwei-Herng Lai, Mohammad Qazim Bhat, Anmoll Kumar Jain, Alfredo Costilla Reyes, Na Zou, Xia Hu

Abstract: Action recognition is an important task for video understanding with broad applications. However, developing an effective action recognition solution often requires extensive engineering efforts in building and testing different combinations of the modules and their hyperparameters. In this demo, we present AutoVideo, a Python system for automated video action recognition. AutoVideo is featured fo… ▽ More Action recognition is an important task for video understanding with broad applications. However, developing an effective action recognition solution often requires extensive engineering efforts in building and testing different combinations of the modules and their hyperparameters. In this demo, we present AutoVideo, a Python system for automated video action recognition. AutoVideo is featured for 1) highly modular and extendable infrastructure following the standard pipeline language, 2) an exhaustive list of primitives for pipeline construction, 3) data-driven tuners to save the efforts of pipeline tuning, and 4) easy-to-use Graphical User Interface (GUI). AutoVideo is released under MIT license at https://github.com/datamllab/autovideo △ Less

Submitted 16 July, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

Comments: Accepted by IJCAI https://github.com/datamllab/autovideo

arXiv:2107.13353 [pdf, other]

Fast Wireless Sensor Anomaly Detection based on Data Stream in Edge Computing Enabled Smart Greenhouse

Authors: Yihong Yang, Sheng Ding, Yuwen Liu, Shunmei Meng, Xiaoxiao Chi, Rui Ma, Chao Yan

Abstract: Edge computing enabled smart greenhouse is a representative application of Internet of Things technology, which can monitor the environmental information in real time and employ the information to contribute to intelligent decision-making. In the process, anomaly detection for wireless sensor data plays an important role. However, traditional anomaly detection algorithms originally designed for an… ▽ More Edge computing enabled smart greenhouse is a representative application of Internet of Things technology, which can monitor the environmental information in real time and employ the information to contribute to intelligent decision-making. In the process, anomaly detection for wireless sensor data plays an important role. However, traditional anomaly detection algorithms originally designed for anomaly detection in static data have not properly considered the inherent characteristics of data stream produced by wireless sensor such as infiniteness, correlations and concept drift, which may pose a considerable challenge on anomaly detection based on data stream, and lead to low detection accuracy and efficiency. First, data stream usually generates quickly which means that it is infinite and enormous, so any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space. Second, there exist correlations among different data streams, which traditional algorithms hardly consider. Third, the underlying data generation process or data distribution may change over time. Thus, traditional anomaly detection algorithms with no model update will lose their effects. Considering these issues, a novel method (called DLSHiForest) on basis of Locality-Sensitive Hashing and time window technique in this paper is proposed to solve these problems while achieving accurate and efficient detection. Comprehensive experiments are executed using real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach. Experimental results show that our proposal is practicable in addressing challenges of traditional anomaly detection while ensuring accuracy and efficiency. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: 12 pages, 8 figures

arXiv:2106.00610 [pdf, other]

Deep Learning for Depression Recognition with Audiovisual Cues: A Review

Authors: Lang He, Mingyue Niu, Prayag Tiwari, Pekka Marttinen, Rui Su, Jiewei Jiang, Chenguang Guo, Hongyu Wang, Songtao Ding, Zhongmin Wang, Wei Dang, Xiaoying Pan

Abstract: With the acceleration of the pace of work and life, people have to face more and more pressure, which increases the possibility of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. Promisingly, physiological and psychological studies have indicated some differences in speech and facial express… ▽ More With the acceleration of the pace of work and life, people have to face more and more pressure, which increases the possibility of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. Promisingly, physiological and psychological studies have indicated some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, many scholars have used deep learning to extract a representation of depression cues in audio and video for automatic depression detection. To sort out and summarize these works, this review introduces the databases and describes objective markers for automatic depression estimation (ADE). Furthermore, we review the deep learning methods for automatic depression detection to extract the representation of depression from audio and video. Finally, this paper discusses challenges and promising directions related to automatic diagnosing of depression using deep learning technologies. △ Less

Submitted 27 May, 2021; originally announced June 2021.

arXiv:2103.00210 [pdf, other]

Application of the unified control and detection framework to detecting stealthy integrity cyber-attacks on feedback control systems

Authors: Steven X. Ding, Linlin Li, Dong Zhao, Chris Louen, Tianyu Liu

Abstract: This draft addresses issues of detecting stealthy integrity cyber-attacks on automatic control systems in the unified control and detection framework. A general form of integrity cyber-attacks that cannot be detected using the well-established observer-based technique is first introduced as kernel attacks. The well-known replay, zero dynamics and covert attacks are special forms of the kernel atta… ▽ More This draft addresses issues of detecting stealthy integrity cyber-attacks on automatic control systems in the unified control and detection framework. A general form of integrity cyber-attacks that cannot be detected using the well-established observer-based technique is first introduced as kernel attacks. The well-known replay, zero dynamics and covert attacks are special forms of the kernel attacks. Existence conditions for the kernel attacks are presented. It is demonstrated, in the unified framework of control and detection, that all kernel attacks can be structurally detected when not only the observer-based residual, but also the control signal based residual signals are generated and used for the detection purpose. Based on the analytical results, two schemes for detecting the kernel attacks are then proposed, which allow reliable attack detection without loss of control performance. While the first scheme is similar to the well-established moving target method and auxiliary system aided detection scheme, the second detector is realised with encrypted transmissions of control and monitoring signals in the feedback control system that prevents adversary to gain system knowledge by means of eavesdropping attacks. Both schemes are illustrated by examples of detecting replay, zero dynamics and covert attacks and an experimental study on a three-tank control system. △ Less

Submitted 4 June, 2021; v1 submitted 27 February, 2021; originally announced March 2021.

arXiv:2012.15427 [pdf, other]

Curriculum-based Deep Reinforcement Learning for Quantum Control

Authors: Hailan Ma, Daoyi Dong, Steven X. Ding, Chunlin Chen

Abstract: Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold… ▽ More Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold. Tasks among a curriculum can be statically determined using empirical knowledge or adaptively generated with the learning process. By transferring knowledge between two successive tasks and sequencing tasks according to their difficulties, the proposed curriculum-based deep reinforcement learning (CDRL) method enables the agent to focus on easy tasks in the early stage, then move onto difficult tasks, and eventually approaches the final task. Numerical simulations on closed quantum systems and open quantum systems demonstrate that the proposed method exhibits improved control performance for quantum systems and also provides an efficient way to identify optimal strategies with fewer control pulses. △ Less

Submitted 2 January, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

arXiv:2011.08482 [pdf]

Collaborative Three-Tier Architecture Non-contact Respiratory Rate Monitoring using Target Tracking and False Peaks Eliminating Algorithms

Authors: Haimiao Mo, Shuai Ding, Shanlin Yang, Athanasios V. Vasilakos, Xi Zheng

Abstract: Monitoring the respiratory rate is crucial for helping us identify respiratory disorders. Devices for conventional respiratory monitoring are inconvenient and scarcely available. Recent research has demonstrated the ability of non-contact technologies, such as photoplethysmography and infrared thermography, to gather respiratory signals from the face and monitor breathing. However, the current non… ▽ More Monitoring the respiratory rate is crucial for helping us identify respiratory disorders. Devices for conventional respiratory monitoring are inconvenient and scarcely available. Recent research has demonstrated the ability of non-contact technologies, such as photoplethysmography and infrared thermography, to gather respiratory signals from the face and monitor breathing. However, the current non-contact respiratory monitoring techniques have poor accuracy because they are sensitive to environmental influences like lighting and motion artifacts. Furthermore, frequent contact between users and the cloud in real-world medical application settings might cause service request delays and potentially the loss of personal data. We proposed a non-contact respiratory rate monitoring system with a cooperative three-layer design to increase the precision of respiratory monitoring and decrease data transmission latency. To reduce data transmission and network latency, our three-tier architecture layer-by-layer decomposes the computing tasks of respiration monitoring. Moreover, we improved the accuracy of respiratory monitoring by designing a target tracking algorithm and an algorithm for eliminating false peaks to extract high-quality respiratory signals. By gathering the data and choosing several regions of interest on the face, we were able to extract the respiration signal and investigate how different regions affected the monitoring of respiration. The results of the experiment indicate that when the nasal region is used to extract the respiratory signal, it performs experimentally best. Our approach performs better than rival approaches while transferring fewer data. △ Less

Submitted 26 July, 2022; v1 submitted 17 November, 2020; originally announced November 2020.

arXiv:2009.03575 [pdf, ps, other]

NC-MOPSO: Network centrality guided multi-objective particle swarm optimization for transport optimization on networks

Authors: Jiexin Wu, Cunlai Pu, Shuxin Ding, Guo Cao, Panos M. Pardalos

Abstract: Transport processes are universal in real-world complex networks, such as communication and transportation networks. As the increase of the traffic in these complex networks, problems like traffic congestion and transport delay are becoming more and more serious, which call for a systematic optimization of these networks. In this paper, we formulate a multi-objective optimization problem (MOP) to… ▽ More Transport processes are universal in real-world complex networks, such as communication and transportation networks. As the increase of the traffic in these complex networks, problems like traffic congestion and transport delay are becoming more and more serious, which call for a systematic optimization of these networks. In this paper, we formulate a multi-objective optimization problem (MOP) to deal with the enhancement of network capacity and efficiency simultaneously, by appropriately adjusting the weights of edges in networks. To solve this problem, we provide a multi-objective evolutionary algorithm (MOEA) based on particle swarm optimization (PSO), namely network centrality guided multi-objective PSO (NC-MOPSO). Specifically, in the framework of PSO, we propose a hybrid population initialization mechanism and a local search strategy by employing the network centrality theory to enhance the quality of initial solutions and strengthen the exploration of the search space, respectively. Simulation experiments performed on network models and real networks show that our algorithm has better performance than four state-of-the-art alternatives on several most-used metrics. △ Less

Submitted 27 July, 2021; v1 submitted 8 September, 2020; originally announced September 2020.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2008.06006 [pdf, other]

Textual Echo Cancellation

Authors: Shaojin Ding, Ye Jia, Ke Hu, Quan Wang

Abstract: In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the pre… ▽ More In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC). △ Less

Submitted 16 September, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

arXiv:2005.03215 [pdf, other]

AutoSpeech: Neural Architecture Search for Speaker Recognition

Authors: Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang

Abstract: Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture… ▽ More Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity. △ Less

Submitted 31 August, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

arXiv:2003.09609 [pdf, ps, other]

Fault-tolerant Coherent H-infinity Control for Linear Quantum Systems

Authors: Yanan Liu, Daoyi Dong, Ian R. Petersen, Qing Gao, Steven X. Ding, Shota Yokoyama, Hidehiro Yonezawa

Abstract: Robustness and reliability are two key requirements for developing practical quantum control systems. The purpose of this paper is to design a coherent feedback controller for a class of linear quantum systems suffering from Markovian jumping faults so that the closed-loop quantum system has both fault tolerance and H-infinity disturbance attenuation performance. This paper first extends the physi… ▽ More Robustness and reliability are two key requirements for developing practical quantum control systems. The purpose of this paper is to design a coherent feedback controller for a class of linear quantum systems suffering from Markovian jumping faults so that the closed-loop quantum system has both fault tolerance and H-infinity disturbance attenuation performance. This paper first extends the physical realization conditions from the time-invariant case to the time-varying case for linear stochastic quantum systems. By relating the fault tolerant H-infinity control problem to the dissipation properties and the solutions of Riccati differential equations, an H-infinity controller for the quantum system is then designed by solving a set of linear matrix inequalities (LMIs). In particular, an algorithm is employed to introduce additional noises and to construct the corresponding input matrices to ensure the physical realizability of the quantum controller. For real applications of the developed fault-tolerant control strategy, we present a linear quantum system example from quantum optics, where the amplitude of the pumping field randomly jumps among different values. It is demonstrated that a quantum H-infinity controller can be designed and implemented using some basic optical components to achieve the desired control goal. △ Less

Submitted 21 March, 2020; originally announced March 2020.

Comments: 12 pages, 3 figures

arXiv:2002.01607 [pdf, other]

Anomaly Detection by One Class Latent Regularized Networks

Authors: Chengwei Chen, Pan Chen, Haichuan Song, Yiqing Tao, Yuan Xie, Shouhong Ding, Lizhuang Ma

Abstract: Anomaly detection is a fundamental problem in computer vision area with many real-world applications. Given a wide range of images belonging to the normal class, emerging from some distribution, the objective of this task is to construct the model to detect out-of-distribution images belonging to abnormal instances. Semi-supervised Generative Adversarial Networks (GAN)-based methods have been gain… ▽ More Anomaly detection is a fundamental problem in computer vision area with many real-world applications. Given a wide range of images belonging to the normal class, emerging from some distribution, the objective of this task is to construct the model to detect out-of-distribution images belonging to abnormal instances. Semi-supervised Generative Adversarial Networks (GAN)-based methods have been gaining popularity in anomaly detection task recently. However, the training process of GAN is still unstable and challenging. To solve these issues, a novel adversarial dual autoencoder network is proposed, in which the underlying structure of training data is not only captured in latent feature space, but also can be further restricted in the space of latent representation in a discriminant manner, leading to a more accurate detector. In addition, the auxiliary autoencoder regarded as a discriminator could obtain an more stable training process. Experiments show that our model achieves the state-of-the-art results on MNIST and CIFAR10 datasets as well as GTSRB stop signs dataset. △ Less

Submitted 14 July, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

arXiv:1912.10753 [pdf, other]

Heterogeneous Hegselmann-Krause Dynamics with Environment and Communication Noise

Authors: Ge Chen, Wei Su, Songyuan Ding, Yiguang Hong

Abstract: The Hegselmann-Krause (HK) model is a wellknown opinion dynamics, attracting a significant amount of interest from a number of fields. However, the heterogeneous HK model is difficult to analyze - even the most basic property of convergence is still open to prove. For the first time, this paper takes into consideration heterogeneous HK models with environment or communication noise. Under environm… ▽ More The Hegselmann-Krause (HK) model is a wellknown opinion dynamics, attracting a significant amount of interest from a number of fields. However, the heterogeneous HK model is difficult to analyze - even the most basic property of convergence is still open to prove. For the first time, this paper takes into consideration heterogeneous HK models with environment or communication noise. Under environment noise, it has been revealed that the heterogeneous HK model with or without global information has a phase transition for the upper limit of the maximum opinion difference, and has a critical noise amplitude depending on the minimal confidence threshold for quasi-synchronization. In addition, the convergence time to the quasi-synchronization is bounded by a negative exponential distribution. The heterogeneous HK model with global information and communication noise is also analyzed. Finally, for the basic HK model with communication noise, we show that the heterogeneous case exhibits a different behavior regarding quasi-synchronization from the homogenous case. Interestingly, raising the confidence thresholds of constituent agents may break quasi-synchronization. Our results reveal that the heterogeneity of individuals is harmful to synchronization, which may be the reason why the synchronization of opinions is hard to reach in reality, even within that of a small group. △ Less

Submitted 23 December, 2019; originally announced December 2019.

arXiv:1909.08723 [pdf, other]

Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Authors: Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

Abstract: We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language… ▽ More We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4--11x faster for decoding than similar systems (e.g. ESPnet). △ Less

Submitted 14 October, 2019; v1 submitted 18 September, 2019; originally announced September 2019.

Comments: Accepted to ASRU 2019

arXiv:1908.04284 [pdf, other]

Personal VAD: Speaker-Conditioned Voice Activity Detection

Authors: Shaojin Ding, Quan Wang, Shuo-yiin Chang, Li Wan, Ignacio Lopez Moreno

Abstract: In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We… ▽ More In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task. △ Less

Submitted 8 April, 2020; v1 submitted 12 August, 2019; originally announced August 2019.

Comments: Speaker Odyssey 2020

arXiv:1908.03735 [pdf]

Automatic acute ischemic stroke lesion segmentation using semi-supervised learning

Authors: Bin Zhao, Shuxue Ding, Hong Wu, Guohua Liu, Chen Cao, Song Jin, Zhiyang Liu

Abstract: Ischemic stroke is a common disease in the elderly population, which can cause long-term disability and even death. However, the time window for treatment of ischemic stroke in its acute stage is very short. To fast localize and quantitively evaluate the acute ischemic stroke (AIS) lesions, many deep-learning-based lesion segmentation methods have been proposed in the literature, where a deep conv… ▽ More Ischemic stroke is a common disease in the elderly population, which can cause long-term disability and even death. However, the time window for treatment of ischemic stroke in its acute stage is very short. To fast localize and quantitively evaluate the acute ischemic stroke (AIS) lesions, many deep-learning-based lesion segmentation methods have been proposed in the literature, where a deep convolutional neural network (CNN) was trained on hundreds of fully labeled subjects with accurate annotations of AIS lesions. Despite that high segmentation accuracy can be achieved, the accurate labels should be annotated by experienced clinicians, and it is therefore very time-consuming to obtain a large number of fully labeled subjects. In this paper, we propose a semi-supervised method to automatically segment AIS lesions in diffusion weighted images and apparent diffusion coefficient maps. By using a large number of weakly labeled subjects and a small number of fully labeled subjects, our proposed method is able to accurately detect and segment the AIS lesions. In particular, our proposed method consists of three parts: 1) a double-path classification net (DPC-Net) trained in a weakly-supervised way is used to detect the suspicious regions of AIS lesions; 2) a pixel-level K-Means clustering algorithm is used to identify the hyperintensive regions on the DWIs; and 3) a region-growing algorithm combines the outputs of the DPC-Net and the K-Means to obtain the final precise lesion segmentation. In our experiment, we use 460 weakly labeled subjects and 15 fully labeled subjects to train and fine-tune the proposed method. By evaluating on a clinical dataset with 150 fully labeled subjects, our proposed method achieves a mean dice coefficient of 0.642, and a lesion-wise F1 score of 0.822. △ Less

Submitted 20 September, 2020; v1 submitted 10 August, 2019; originally announced August 2019.

arXiv:1803.10299 [pdf, other]

Multi-Modal Data Augmentation for End-to-End ASR

Authors: Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

Abstract: We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MM… ▽ More We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10\% relative word error rate (WER) improvement over a baseline both with and without an external language model. △ Less

Submitted 18 June, 2018; v1 submitted 27 March, 2018; originally announced March 2018.

Comments: 5 Pages, 1 Figure, accepted at INTERSPEECH 2018

Showing 1–42 of 42 results for author: Ding, S