Search | arXiv e-print repository

Target conversation extraction: Source separation using turn-taking dynamics

Authors: Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota

Abstract: Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in h… ▽ More Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics, which uniquely characterize speakers engaged in conversation and distinguish them from interfering speakers and noise. Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets. In the presence of interfering speakers, our results show an 8.19 dBでしべる improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dBでしべる improvement for 2-4-speaker conversations. Code, dataset available at https://github.com/chentuochao/Target-Conversation-Extraction. △ Less

Submitted 29 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.18373 [pdf, other]

Dynamic Data Pruning for Automatic Speech Recognition

Authors: Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. Furthermore, we introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers several fine-grained pruning granularities specifically tailored for speech-related datasets, going beyond the conventional pruning of entire time sequences. Our intensive experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2406.13357 [pdf, other]

Transferable speech-to-text large language model alignment module

Authors: Boyong Wu, Chao Yan, Haoran Pu

Abstract: By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achiev… ▽ More By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted by InterSpeech 2024; 5 pages, 2 figures

arXiv:2405.20617 [pdf, other]

Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Authors: Yuning Zhang, Thomas Choi, Zihang Cheng, Issei Kanno, Masaaki Ito, Jorge Gomez-Ponce, Hussein Hammoud, Bowei Wu, Ashwani Pradhan, Kelvin Arana, Pramod Krishna, Tianyi Yang, Tyler Chen, Ishita Vasishtha, Haoyu Xie, Linyu Sun, Andreas F. Molisch

Abstract: The design of cell-free massive MIMO (CF-mMIMO) systems requires accurate, measurement-based channel models. This paper provides the first results from the by far most extensive outdoor measurement campaign for CF-mMIMO channels in an urban environment. We measured impulse responses between over 20,000 potential access point (AP) locations and 80 user equipments (UEs) at 3.5 GHz with 350 MHz bandw… ▽ More The design of cell-free massive MIMO (CF-mMIMO) systems requires accurate, measurement-based channel models. This paper provides the first results from the by far most extensive outdoor measurement campaign for CF-mMIMO channels in an urban environment. We measured impulse responses between over 20,000 potential access point (AP) locations and 80 user equipments (UEs) at 3.5 GHz with 350 MHz bandwidth (BW). Measurements use a "virtual array" approach at the AP and a hybrid switched/virtual approach at the UE. This paper describes the sounder design, measurement environment, data processing, and sample results, particularly the evolution of the power-delay profiles (PDPs) as a function of the AP locations, and its relation to the propagation environment. △ Less

Submitted 6 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: Submitted to: VTC 2024-Fall

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2311.05929 [pdf, other]

Efficient Segmentation with Texture in Ore Images Based on Box-supervised Approach

Authors: Guodong Sun, Delong Huang, Yuting Peng, Le Cheng, Bo Wu, Yang Zhang

Abstract: Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is prov… ▽ More Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is provided for ore image segmentation that can identify complete and independent ores. Firstly, a ghost feature pyramid network (Ghost-FPN) is proposed to process the features obtained from the backbone to reduce redundant semantic information and computation generated by complex networks. Then, an optimized detection head is proposed to obtain the feature to maintain accuracy. Finally, Lab color space (Lab) and local binary patterns (LBP) texture features are combined to form a fusion feature similarity-based loss function to improve accuracy while incurring no loss. Experiments on MS COCO have shown that the proposed fusion features are also worth studying on other types of datasets. Extensive experimental results demonstrate the effectiveness of the proposed method, which achieves over 50 frames per second with a small model size of 21.6 MB. Meanwhile, the method maintains a high level of accuracy compared with the state-of-the-art approaches on ore image dataset. The source code is available at \url{https://github.com/MVME-HBUT/OREINST}. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: 14 pages, 8 figures

arXiv:2306.02894 [pdf, ps, other]

Recyclable Semi-supervised Method Based on Multi-model Ensemble for Video Scene Parsing

Authors: Biao Wu, Shaoli Liu, Diankai Zhang, Chengjian Zheng, Si Gao, Xiaofeng Zhang, Ning Wang

Abstract: Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic segmentation is more reasonable and practical for realistic applications. In this paper, we adopt Mask2Former… ▽ More Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic segmentation is more reasonable and practical for realistic applications. In this paper, we adopt Mask2Former as architecture and ViT-Adapter as backbone. Then, we propose a recyclable semi-supervised training method based on multi-model ensemble. Our method achieves the mIoU scores of 62.97% and 65.83% on Development test and final test respectively. Finally, we obtain the 2nd place in the Video Scene Parsing in the Wild Challenge at CVPR 2023. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.01183 [pdf, other]

Faster OreFSDet : A Lightweight and Effective Few-shot Object Detector for Ore Images

Authors: Yang Zhang, Le Cheng, Yuting Peng, Chengming Xu, Yanwei Fu, Bo Wu, Guodong Sun

Abstract: For the ore particle size detection, obtaining a sizable amount of high-quality ore labeled data is time-consuming and expensive. General object detection methods often suffer from severe over-fitting with scarce labeled data. Despite their ability to eliminate over-fitting, existing few-shot object detectors encounter drawbacks such as slow detection speed and high memory requirements, making the… ▽ More For the ore particle size detection, obtaining a sizable amount of high-quality ore labeled data is time-consuming and expensive. General object detection methods often suffer from severe over-fitting with scarce labeled data. Despite their ability to eliminate over-fitting, existing few-shot object detectors encounter drawbacks such as slow detection speed and high memory requirements, making them difficult to implement in a real-world deployment scenario. To this end, we propose a lightweight and effective few-shot detector to achieve competitive performance with general object detection with only a few samples for ore images. First, the proposed support feature mining block characterizes the importance of location information in support features. Next, the relationship guidance block makes full use of support features to guide the generation of accurate candidate proposals. Finally, the dual-scale semantic aggregation module retrieves detailed features at different resolutions to contribute with the prediction process. Experimental results show that our method consistently exceeds the few-shot detectors with an excellent performance gap on all metrics. Moreover, our method achieves the smallest model size of 19MB as well as being competitive at 50 FPS detection speed compared with general object detectors. The source code is available at https://github.com/MVME-HBUT/Faster-OreFSDet. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: 18 pages, 11 figures

arXiv:2212.13523 [pdf, other]

doi 10.1109/TGRS.2023.3268554

S2S-WTV: Seismic Data Noise Attenuation Using Weighted Total Variation Regularized Self-Supervised Learning

Authors: Zitai Xu, Yisi Luo, Bangyu Wu, Deyu Meng

Abstract: Seismic data often undergoes severe noise due to environmental factors, which seriously affects subsequent applications. Traditional hand-crafted denoisers such as filters and regularizations utilize interpretable domain knowledge to design generalizable denoising techniques, while their representation capacities may be inferior to deep learning denoisers, which can learn complex and representativ… ▽ More Seismic data often undergoes severe noise due to environmental factors, which seriously affects subsequent applications. Traditional hand-crafted denoisers such as filters and regularizations utilize interpretable domain knowledge to design generalizable denoising techniques, while their representation capacities may be inferior to deep learning denoisers, which can learn complex and representative denoising mappings from abundant training pairs. However, due to the scarcity of high-quality training pairs, deep learning denoisers may sustain some generalization issues over various scenarios. In this work, we propose a self-supervised method that combines the capacities of deep denoiser and the generalization abilities of hand-crafted regularization for seismic data random noise attenuation. Specifically, we leverage the Self2Self (S2S) learning framework with a trace-wise masking strategy for seismic data denoising by solely using the observed noisy data. Parallelly, we suggest the weighted total variation (WTV) to further capture the horizontal local smooth structure of seismic data. Our method, dubbed as S2S-WTV, enjoys both high representation abilities brought from the self-supervised deep network and good generalization abilities of the hand-crafted WTV regularizer and the self-supervised nature. Therefore, our method can more effectively and stably remove the random noise and preserve the details and edges of the clean signal. To tackle the S2S-WTV optimization model, we introduce an alternating direction multiplier method (ADMM)-based algorithm. Extensive experiments on synthetic and field noisy seismic data demonstrate the effectiveness of our method as compared with state-of-the-art traditional and deep learning-based seismic data denoising methods. △ Less

Submitted 27 December, 2022; originally announced December 2022.

Journal ref: TGRS 2023

arXiv:2211.14522 [pdf, other]

Visual Fault Detection of Multi-scale Key Components in Freight Trains

Authors: Yang Zhang, Yang Zhou, Huilin Pan, Bo Wu, Guodong Sun

Abstract: Fault detection for key components in the braking system of freight trains is critical for ensuring railway transportation safety. Despite the frequently employed methods based on deep learning, these fault detectors are highly reliant on hardware resources and are complex to implement. In addition, no train fault detectors consider the drop in accuracy induced by scale variation of fault parts. T… ▽ More Fault detection for key components in the braking system of freight trains is critical for ensuring railway transportation safety. Despite the frequently employed methods based on deep learning, these fault detectors are highly reliant on hardware resources and are complex to implement. In addition, no train fault detectors consider the drop in accuracy induced by scale variation of fault parts. This paper proposes a lightweight anchor-free framework to solve the above problems. Specifically, to reduce the amount of computation and model size, we introduce a lightweight backbone and adopt an anchor-free method for localization and regression. To improve detection accuracy for multi-scale parts, we design a feature pyramid network to generate rectangular layers of different sizes to map parts with similar aspect ratios. Experiments on four fault datasets show that our framework achieves 98.44% accuracy while the model size is only 22.5 MB, outperforming state-of-the-art detectors. △ Less

Submitted 26 November, 2022; originally announced November 2022.

Comments: 9 pages, 4 figures

arXiv:2211.05256 [pdf, other]

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong, Hong-Han Shuai, Wen-Huang Cheng, Zhuang Jia, Tianyu Xu, Yijian Zhang, Long Bao, Heng Sun, Diankai Zhang, Si Gao, Shaoli Liu, Biao Wu, Xiaofeng Zhang, Chengjian Zheng, Kaidi Lu, Ning Wang , et al. (29 additional authors not shown)

Abstract: Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this prob… ▽ More Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2105.08826, arXiv:2105.07809, arXiv:2211.04470, arXiv:2211.03885

arXiv:2205.12458 [pdf, other]

doi 10.1109/TIM.2022.3176901

A Lightweight NMS-free Framework for Real-time Visual Fault Detection System of Freight Trains

Authors: Guodong Sun, Yang Zhou, Huilin Pan, Bo Wu, Ye Hu, Yang Zhang

Abstract: Real-time vision-based system of fault detection (RVBS-FD) for freight trains is an essential part of ensuring railway transportation safety. Most existing vision-based methods still have high computational costs based on convolutional neural networks. The computational cost is mainly reflected in the backbone, neck, and post-processing, i.e., non-maximum suppression (NMS). In this paper, we propo… ▽ More Real-time vision-based system of fault detection (RVBS-FD) for freight trains is an essential part of ensuring railway transportation safety. Most existing vision-based methods still have high computational costs based on convolutional neural networks. The computational cost is mainly reflected in the backbone, neck, and post-processing, i.e., non-maximum suppression (NMS). In this paper, we propose a lightweight NMS-free framework to achieve real-time detection and high accuracy simultaneously. First, we use a lightweight backbone for feature extraction and design a fault detection pyramid to process features. This fault detection pyramid includes three novel individual modules using attention mechanism, bottleneck, and dilated convolution for feature enhancement and computation reduction. Instead of using NMS, we calculate different loss functions, including classification and location costs in the detection head, to further reduce computation. Experimental results show that our framework achieves over 83 frames per second speed with a smaller model size and higher accuracy than the state-of-the-art detectors. Meanwhile, the hardware resource requirements of our method are low during the training and testing process. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: 11 pages, 5 figures, accepted by IEEE Transactions on Instrumentation and Measurement

arXiv:2203.16090 [pdf, other]

doi 10.1109/LCSYS.2022.3186236

A simple suboptimal moving horizon estimation scheme with guaranteed robust stability

Authors: Julian D. Schiller, Boyang Wu, Matthias A. Müller

Abstract: We propose a suboptimal moving horizon estimation (MHE) scheme for a general class of nonlinear systems. To this end, we consider an MHE formulation that optimizes over the trajectory of a robustly stable observer. Assuming that the observer admits a Lyapunov function, we show that this function is an M-step Lyapunov function for suboptimal MHE. The presented sufficient conditions can be easily ve… ▽ More We propose a suboptimal moving horizon estimation (MHE) scheme for a general class of nonlinear systems. To this end, we consider an MHE formulation that optimizes over the trajectory of a robustly stable observer. Assuming that the observer admits a Lyapunov function, we show that this function is an M-step Lyapunov function for suboptimal MHE. The presented sufficient conditions can be easily verified in practice. We illustrate the practicability of the proposed suboptimal MHE scheme with a standard nonlinear benchmark example. Here, performing a single iteration is sufficient to significantly improve the observer's estimation results under valid theoretical guarantees. △ Less

Submitted 15 July, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

Journal ref: IEEE Control Systems Letters, vol. 7, pp. 19-24, 2023

arXiv:2110.12224 [pdf, other]

Generalized Polarization Transform: A Novel Coded Transmission Paradigm

Authors: Bolin Wu, Jincheng Dai, Kai Niu, Zhongwei Si, Ping Zhang, Sen Wang, Yifei Yuan, Chih-Lin I

Abstract: For the upcoming 6G wireless networks, a new wave of applications and services will demand ultra-high data rates and reliability. To this end, future wireless systems are expected to pave the way for entirely new fundamental air interface technologies to attain a breakthrough in spectrum efficiency (SE). This article discusses a new paradigm, named generalized polarization transform (GPT), to achi… ▽ More For the upcoming 6G wireless networks, a new wave of applications and services will demand ultra-high data rates and reliability. To this end, future wireless systems are expected to pave the way for entirely new fundamental air interface technologies to attain a breakthrough in spectrum efficiency (SE). This article discusses a new paradigm, named generalized polarization transform (GPT), to achieve an integrated design of coding, modulation, multi-antenna, multiple access, etc., in a real sense. The GPT enabled air interface develops far-reaching insights that the joint optimization of critical air interface ingredients can achieve remarkable gains on SE compared with the state-of-the-art module-stacking design. △ Less

Submitted 27 April, 2022; v1 submitted 23 October, 2021; originally announced October 2021.

arXiv:2109.06715 [pdf, other]

doi 10.1109/MNET.001.2100266

IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems

Authors: David Pujol-Perich, José Suárez-Varela, Miquel Ferriol, Shihan Xiao, Bo Wu, Albert Cabellos-Aparicio, Pere Barlet-Ros

Abstract: Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks uns… ▽ More Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking. However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming. This poses an important barrier to network engineers that often do not have the necessary ML expertise. In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems. IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures. To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases. Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow. △ Less

Submitted 2 February, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

Journal ref: IEEE Network, vol. 35, no. 6, pp. 171-177, 2021

arXiv:2107.14406 [pdf]

doi 10.1061/JTEPBS.TEENG-7699

Optimal Variable Speed Limit Control Strategy on Freeway Segments under Fog Conditions

Authors: Ben Zhai, Yanli Wang, Wenxuan Wang, Bing Wu

Abstract: Fog is a critical external factor that threatens traffic safety on freeways. Variable speed limit (VSL) control can effectively harmonize vehicle speed and improve safety. However, most existing weather-related VSL controllers are limited to adapt to the dynamic traffic environment. This study developed optimal VSL control strategy under fog conditions with fully consideration of factors that affe… ▽ More Fog is a critical external factor that threatens traffic safety on freeways. Variable speed limit (VSL) control can effectively harmonize vehicle speed and improve safety. However, most existing weather-related VSL controllers are limited to adapt to the dynamic traffic environment. This study developed optimal VSL control strategy under fog conditions with fully consideration of factors that affect traffic safety risks. The crash risk under fog conditions was estimated using a crash risk prediction model based on Bayesian logistic regression. The traffic flow with VSL control was simulated by a modified cell transmission model (MCTM). The optimal factors of VSL control were obtained by solving an optimization problem that coordinated safety and mobility with the help of the genetic algorithm. An example of I-405 in California, USA was designed to simulate and evaluate the effects of the proposed VSL control strategy. The optimal VSL control factors under fog conditions were compared with sunny conditions, and different placements of VSL signs were evaluated. Results showed that the optimal VSL control strategy under fog conditions changed the speed limit more cautiously. The VSL control under fog conditions in this study effectively reduced crash risks without significantly increasing travel time, which is up to 37.15% reduction of risks and only 0.48% increase of total travel time. The proposed VSL control strategy is expected to be of great use in the development of VSL systems to enhance freeway safety under fog conditions. △ Less

Submitted 29 July, 2021; originally announced July 2021.

arXiv:2107.14137 [pdf]

Radio Frequency Interference Management with Free-Space Optical Communication and Photonic Signal Processing

Authors: Yang Qi, Ben Wu

Abstract: We design and experimentally demonstrate a radio frequency interference management system with free-space optical communication and photonic signal processing. The system provides real-time interference cancellation in 6 GHz wide bandwidth. We design and experimentally demonstrate a radio frequency interference management system with free-space optical communication and photonic signal processing. The system provides real-time interference cancellation in 6 GHz wide bandwidth. △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: Frontier in Optics 2021

arXiv:2107.14134 [pdf]

Photonic Interference Cancellation with Hybrid Free Space Optical Communication and MIMO Receiver

Authors: Taichu Shi, Yang Qi, Ben Wu

Abstract: We proposed and demonstrated a hybrid blind source separation system which can switch between multiple-input and multi-output mode and free space optical communication mode depends on different situation to get best condition for separation. We proposed and demonstrated a hybrid blind source separation system which can switch between multiple-input and multi-output mode and free space optical communication mode depends on different situation to get best condition for separation. △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: Frontier in Optics 2021

arXiv:2107.14133 [pdf]

doi 10.1364/OE.435282

Sub-Nyquist Sampling with Optical Pulses for Photonic Blind Source Separation

Authors: Taichu Shi, Yang Qi, Weipeng Zhang, Paul Prucnal, Ben Wu

Abstract: We proposed and demonstrated an optical pulse sampling method for photonic blind source separation. It can separate large bandwidth of mixed signals by small sampling frequency, which can reduce the workload of digital signal processing. We proposed and demonstrated an optical pulse sampling method for photonic blind source separation. It can separate large bandwidth of mixed signals by small sampling frequency, which can reduce the workload of digital signal processing. △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: Frontier in Optics

arXiv:2107.10415 [pdf]

Wideband photonic interference cancellation based on free space optical communication

Authors: Yang Qi, Ben Wu

Abstract: We propose and experimentally demonstrate an interference management system that removes wideband wireless interference by using photonic signal processing and free space optical communication. The receiver separates radio frequency interferences by upconverting the mixed signals to optical frequencies and processing the signals with the photonic circuits. Signals with GHz bandwidth are processed… ▽ More We propose and experimentally demonstrate an interference management system that removes wideband wireless interference by using photonic signal processing and free space optical communication. The receiver separates radio frequency interferences by upconverting the mixed signals to optical frequencies and processing the signals with the photonic circuits. Signals with GHz bandwidth are processed and separated in real-time. The reference signals for interference cancellation are transmitted in a free space optical communication link, which provides large bandwidth for multi-band operation and accelerates the mixed signal separation process by reducing the dimensions of the un-known mixing matrix. Experimental results show that the system achieves 30dBでしべる real-time cancellation depth with over 6GHz bandwidth. Multiple radio frequency bands can be processed at the same time with a single system. In addition, multiple radio frequency bands can be processed at the same time with a single system. △ Less

Submitted 13 November, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

arXiv:2107.10357 [pdf]

Wideband photonic blind source separation with optical pulse sampling

Authors: Taichu Shi, Yang Qi, Weipeng Zhang, Paul R. Prucnal, Jie Li, Ben Wu

Abstract: We propose and experimentally demonstrate an optical pulse sampling method for photonic blind source separation. The photonic system processes and separates wideband signals based on the statistical information of the mixed signals and thus the sampling frequency can be orders of magnitude lower than the bandwidth of the signals. The ultra-fast optical pulse functions as a tweezer that collects sa… ▽ More We propose and experimentally demonstrate an optical pulse sampling method for photonic blind source separation. The photonic system processes and separates wideband signals based on the statistical information of the mixed signals and thus the sampling frequency can be orders of magnitude lower than the bandwidth of the signals. The ultra-fast optical pulse functions as a tweezer that collects samples of the signals at very low sampling rates, and each sample is short enough to maintain the statistical properties of the signals. The low sampling frequency reduces the workloads of the analog to digital conversion and digital signal processing systems. In the meantime, the short pulse sampling maintains the accuracy of the sampled signals, so the statistical properties of the undersampling signals are the same as the statistical properties of the original signals. With the optical pulses generated from a mode-locked laser, the optical pulse sampling system is able to process and separate mixed signals with bandwidth over 100GHz and achieves a dynamic range of 30dBでしべる. △ Less

Submitted 21 July, 2021; originally announced July 2021.

arXiv:2103.16849 [pdf, other]

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Authors: Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

Abstract: In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Tempo… ▽ More In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Temporal Attention approach (FTA) is proposed, which models the correlations between the fullband information of the context frames. In addition, considering the difference between the attenuation of high frequency bands and low frequency bands (high frequency bands attenuate faster than low frequency bands) in the room impulse response (RIR), we also propose a SubBand based Temporal Attention approach (STA). In order to guide the network to be more aware of the reverberant environments, we jointly optimize the dereverberation network and the reverberation time (RT60) estimator in a multi-task manner. Our experimental results indicate that the proposed method outperforms our previously proposed reverberation-time-aware DNN and the learned attention weights are fully physical consistent. We also report a preliminary yet promising dereverberation and recognition experiment on real test data. △ Less

Submitted 26 August, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: Submitted to Interspeech 2021

arXiv:2012.10580 [pdf, other]

Identifying Invariant Texture Violation for Robust Deepfake Detection

Authors: Xinwei Sun, Botong Wu, Wei Chen

Abstract: Existing deepfake detection methods have reported promising in-distribution results, by accessing published large-scale dataset. However, due to the non-smooth synthesis method, the fake samples in this dataset may expose obvious artifacts (e.g., stark visual contrast, non-smooth boundary), which were heavily relied on by most of the frame-level detection methods above. As these artifacts do not c… ▽ More Existing deepfake detection methods have reported promising in-distribution results, by accessing published large-scale dataset. However, due to the non-smooth synthesis method, the fake samples in this dataset may expose obvious artifacts (e.g., stark visual contrast, non-smooth boundary), which were heavily relied on by most of the frame-level detection methods above. As these artifacts do not come up in real media forgeries, the above methods can suffer from a large degradation when applied to fake images that close to reality. To improve the robustness for high-realism fake data, we propose the Invariant Texture Learning (InTeLe) framework, which only accesses the published dataset with low visual quality. Our method is based on the prior that the microscopic facial texture of the source face is inevitably violated by the texture transferred from the target person, which can hence be regarded as the invariant characterization shared among all fake images. To learn such an invariance for deepfake detection, our InTeLe introduces an auto-encoder framework with different decoders for pristine and fake images, which are further appended with a shallow classifier in order to separate out the obvious artifact-effect. Equipped with such a separation, the extracted embedding by encoder can capture the texture violation in fake images, followed by the classifier for the final pristine/fake prediction. As a theoretical guarantee, we prove the identifiability of such an invariance texture violation, i.e., to be precisely inferred from observational data. The effectiveness and utility of our method are demonstrated by promising generalization ability from low-quality images with obvious artifacts to fake images with high realism. △ Less

Submitted 18 December, 2020; originally announced December 2020.

arXiv:2011.12985 [pdf, other]

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Authors: Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter Vajda

Abstract: Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal… ▽ More Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos. △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:2011.09162 [pdf, other]

WPD++: An Improved Neural Beamformer for Simultaneous Speech Separation and Dereverberation

Authors: Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shixiong Zhang, Dong Yu, Michael I Mandel

Abstract: This paper aims at eliminating the interfering speakers' speech, additive noise, and reverberation from the noisy multi-talker speech mixture that benefits automatic speech recognition (ASR) backend. While the recently proposed Weighted Power minimization Distortionless response (WPD) beamformer can perform separation and dereverberation simultaneously, the noise cancellation component still has t… ▽ More This paper aims at eliminating the interfering speakers' speech, additive noise, and reverberation from the noisy multi-talker speech mixture that benefits automatic speech recognition (ASR) backend. While the recently proposed Weighted Power minimization Distortionless response (WPD) beamformer can perform separation and dereverberation simultaneously, the noise cancellation component still has the potential to progress. We propose an improved neural WPD beamformer called "WPD++" by an enhanced beamforming module in the conventional WPD and a multi-objective loss function for the joint training. The beamforming module is improved by utilizing the spatio-temporal correlation. A multi-objective loss, including the complex spectra domain scale-invariant signal-to-noise ratio (C-Si-SNR) and the magnitude domain mean square error (Mag-MSE), is properly designed to make multiple constraints on the enhanced speech and the desired power of the dry clean signal. Joint training is conducted to optimize the complex-valued mask estimator and the WPD++ beamformer in an end-to-end way. The results show that the proposed WPD++ outperforms several state-of-the-art beamformers on the enhanced speech quality and word error rate (WER) of ASR. △ Less

Submitted 18 November, 2020; originally announced November 2020.

Comments: accepted by SLT 2021

arXiv:2011.07755 [pdf, other]

Audio-visual Multi-channel Integration and Recognition of Overlapped Speech

Authors: Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu

Abstract: Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to sep… ▽ More Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to separate the target speaker from the interfering sound sources, this paper presents an audio-visual multi-channel based recognition system for overlapped speech. It benefits from a tight integration between a speech separation front-end and recognition back-end, both of which incorporate additional video input. A series of audio-visual multi-channel speech separation front-end components based on TF masking, Filter&Sum and mask-based MVDR neural channel integration approaches are developed. To reduce the error cost mismatch between the separation and recognition components, the entire system is jointly fine-tuned using a multi-task criterion interpolation of the scale-invariant signal to noise ratio (Si-SNR) with either the connectionist temporal classification (CTC), or lattice-free maximum mutual information (LF-MMI) loss function. Experiments suggest that: the proposed audio-visual multi-channel recognition system outperforms the baseline audio-only multi-channel ASR system by up to 8.04% (31.68% relative) and 22.86% (58.51% relative) absolute WER reduction on overlapped speech constructed using either simulation or replaying of the LRS2 dataset respectively. Consistent performance improvements are also obtained using the proposed audio-visual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%. △ Less

Submitted 30 August, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

Comments: TASLP 2021

arXiv:2010.11607 [pdf, other]

Backdoor Attack against Speaker Verification

Authors: Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, Shu-Tao Xia

Abstract: Speaker verification has been widely and successfully adopted in many mission-critical areas for user identification. The training of speaker verification requires a large amount of data, therefore users usually need to adopt third-party data ($e.g.$, data from the Internet or third-party data company). This raises the question of whether adopting untrusted third-party data can pose a security thr… ▽ More Speaker verification has been widely and successfully adopted in many mission-critical areas for user identification. The training of speaker verification requires a large amount of data, therefore users usually need to adopt third-party data ($e.g.$, data from the Internet or third-party data company). This raises the question of whether adopting untrusted third-party data can pose a security threat. In this paper, we demonstrate that it is possible to inject the hidden backdoor for infecting speaker verification models by poisoning the training data. Specifically, we design a clustering-based attack scheme where poisoned samples from different clusters will contain different triggers ($i.e.$, pre-defined utterances), based on our understanding of verification tasks. The infected models behave normally on benign samples, while attacker-specified unenrolled triggers will successfully pass the verification even if the attacker has no information about the enrolled speaker. We also demonstrate that existing backdoor attacks cannot be directly adopted in attacking speaker verification. Our approach not only provides a new perspective for designing novel attacks, but also serves as a strong baseline for improving the robustness of verification methods. The code for reproducing main results is available at \url{https://github.com/zhaitongqing233/Backdoor-attack-against-speaker-verification}. △ Less

Submitted 2 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: Accepted by the ICASSP 2021. The first two authors contributed equally to this work

arXiv:2009.00155 [pdf, other]

A Review of Single-Source Deep Unsupervised Visual Domain Adaptation

Authors: Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, Kurt Keutzer

Abstract: Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks. However, in many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another s… ▽ More Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks. However, in many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain. Unfortunately, direct transfer across domains often performs poorly due to the presence of domain shift or dataset bias. Domain adaptation is a machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this paper, we review the latest single-source deep unsupervised domain adaptation methods focused on visual tasks and discuss new perspectives for future research. We begin with the definitions of different domain adaptation strategies and the descriptions of existing benchmark datasets. We then summarize and compare different categories of single-source unsupervised domain adaptation methods, including discrepancy-based methods, adversarial discriminative methods, adversarial generative methods, and self-supervision-based methods. Finally, we discuss future research directions with challenges and possible solutions. △ Less

Submitted 18 September, 2020; v1 submitted 31 August, 2020; originally announced September 2020.

arXiv:2008.04768 [pdf, other]

Constrained Active Classification Using Partially Observable Markov Decision Processes

Authors: Bo Wu, Niklas Lauffer, Mohamadreza Ahmadi, Suda Bharadwaj, Zhe Xu, Ufuk Topcu

Abstract: In this work, we study the problem of actively classifying the attributes of dynamical systems characterized as a finite set of Markov decision process (MDP) models. We are interested in finding strategies that actively interact with the dynamical system and observe its reactions so that the attribute of interest is classified efficiently with high confidence. We present a decision-theoretic frame… ▽ More In this work, we study the problem of actively classifying the attributes of dynamical systems characterized as a finite set of Markov decision process (MDP) models. We are interested in finding strategies that actively interact with the dynamical system and observe its reactions so that the attribute of interest is classified efficiently with high confidence. We present a decision-theoretic framework based on partially observable Markov decision processes (POMDPs). The proposed framework relies on assigning a classification belief (a probability distribution) to the attributes of interest. Given an initial belief, a confidence level over which a classification decision can be made, a cost bound, safe belief sets, and a finite time horizon, we compute POMDP strategies leading to classification decisions. We present three different algorithms to compute such strategies. The first algorithm computes the optimal strategy exactly by value iteration. To overcome the computational complexity of computing the exact solutions, we propose a second algorithm based on adaptive sampling and a third based on a Monte Carlo tree search to approximate the optimal probability of reaching a classification decision. We illustrate the proposed methodology using examples from medical diagnosis, security surveillance, and wildlife classification. △ Less

Submitted 4 January, 2023; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1810.00097

arXiv:2008.00164 [pdf, other]

Byzantine-Resilient Distributed Hypothesis Testing With Time-Varying Network Topology

Authors: Bo Wu, Steven Carr, Suda Bharadwaj, Zhe Xu, Ufuk Topcu

Abstract: We study the problem of distributed hypothesis testing over a network of mobile agents with limited communication and sensing ranges to infer the true hypothesis collaboratively. In particular, we consider a scenario where there is an unknown subset of compromised agents that may deliberately share altered information to undermine the team objective. We propose two distributed algorithms where eac… ▽ More We study the problem of distributed hypothesis testing over a network of mobile agents with limited communication and sensing ranges to infer the true hypothesis collaboratively. In particular, we consider a scenario where there is an unknown subset of compromised agents that may deliberately share altered information to undermine the team objective. We propose two distributed algorithms where each agent maintains and updates two sets of beliefs (i.e., probability distributions over the hypotheses), namely local and actual beliefs (LB and AB respectively for brevity). In both algorithms, at every time step, each agent shares its AB with other agents within its communication range and makes a local observation to update its LB. Then both algorithms can use the shared information to update ABs under certain conditions. One requires receiving a certain number of shared ABs at each time instant; the other accumulates shared ABs over time and updates after the number of shared ABs exceeds a prescribed threshold. Otherwise, both algorithms rely on the agent's current LB and AB to update the new AB. We prove under mild assumptions that the AB for every non-compromised agent converges almost surely to the true hypothesis, without requiring connectivity in the underlying time-varying network topology. Using a simulation of a team of unmanned aerial vehicles aiming to classify adversarial agents among themselves, we illustrate and compare the proposed algorithms. Finally, we show experimentally that the second algorithm consistently outperforms the first algorithm in terms of the speed of convergence. △ Less

Submitted 17 July, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

arXiv:2007.15874 [pdf]

Residual-CycleGAN based Camera Adaptation for Robust Diabetic Retinopathy Screening

Authors: Dalu Yang, Yehui Yang, Tiantian Huang, Binghong Wu, Lei Wang, Yanwu Xu

Abstract: There are extensive researches focusing on automated diabetic reti-nopathy (DR) detection from fundus images. However, the accuracy drop is ob-served when applying these models in real-world DR screening, where the fun-dus camera brands are different from the ones used to capture the training im-ages. How can we train a classification model on labeled fundus images ac-quired from only one camera b… ▽ More There are extensive researches focusing on automated diabetic reti-nopathy (DR) detection from fundus images. However, the accuracy drop is ob-served when applying these models in real-world DR screening, where the fun-dus camera brands are different from the ones used to capture the training im-ages. How can we train a classification model on labeled fundus images ac-quired from only one camera brand, yet still achieves good performance on im-ages taken by other brands of cameras? In this paper, we quantitatively verify the impact of fundus camera brands related domain shift on the performance of DR classification models, from an experimental perspective. Further, we pro-pose camera-oriented residual-CycleGAN to mitigate the camera brand differ-ence by domain adaptation and achieve increased classification performance on target camera images. Extensive ablation experiments on both the EyePACS da-taset and a private dataset show that the camera brand difference can signifi-cantly impact the classification performance and prove that our proposed meth-od can effectively improve the model performance on the target domain. We have inferred and labeled the camera brand for each image in the EyePACS da-taset and will publicize the camera brand labels for further research on domain adaptation. △ Less

Submitted 31 July, 2020; originally announced July 2020.

arXiv:2007.15114 [pdf, other]

doi 10.1371/journal.pone.0247660

Control Strategies for COVID-19 Epidemic with Vaccination, Shield Immunity and Quarantine: A Metric Temporal Logic Approach

Authors: Zhe Xu, Bo Wu, Ufuk Topcu

Abstract: Ever since the outbreak of the COVID-19 epidemic, various public health control strategies have been proposed and tested against the coronavirus SARS-CoV-2. We study three specific COVID-19 epidemic control models: the susceptible, exposed, infectious, recovered (SEIR) model with vaccination control; the SEIR model with shield immunity control; and the susceptible, un-quarantined infected, quarant… ▽ More Ever since the outbreak of the COVID-19 epidemic, various public health control strategies have been proposed and tested against the coronavirus SARS-CoV-2. We study three specific COVID-19 epidemic control models: the susceptible, exposed, infectious, recovered (SEIR) model with vaccination control; the SEIR model with shield immunity control; and the susceptible, un-quarantined infected, quarantined infected, confirmed infected (SUQC) model with quarantine control. We express the control requirement in metric temporal logic (MTL) formulas (a type of formal specification languages) which can specify the expected control outcomes such as "the deaths from the infection should never exceed one thousand per day within the next three months" or "the population immune from the disease should eventually exceed 200 thousand within the next 100 to 120 days". We then develop methods for synthesizing control strategies with MTL specifications. To the best of our knowledge, this is the first paper to systematically synthesize control strategies based on the COVID-19 epidemic models with formal specifications. We provide simulation results in three different case studies: vaccination control for the COVID-19 epidemic with model parameters estimated from data in Lombardy, Italy; shield immunity control for the COVID-19 epidemic with model parameters estimated from data in Lombardy, Italy; and quarantine control for the COVID-19 epidemic with model parameters estimated from data in Wuhan, China. The results show that the proposed synthesis approach can generate control inputs such that the time-varying numbers of individuals in each category (e.g., infectious, immune) satisfy the MTL specifications. The results also show that early intervention is essential in mitigating the spread of COVID-19, and more control effort is needed for more stringent MTL specifications. △ Less

Submitted 12 August, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:2007.01566 [pdf, other]

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Authors: Bo Wu, Meng Yu, Lianwu Chen, Yong Xu, Chao Weng, Dan Su, Dong Yu

Abstract: Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a mu… ▽ More Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a multi-channel dilated convolutional network based frequency domain modeling is presented to enhance target speaker in the far-field, noisy and multi-talker conditions. We study three approaches towards distortionless waveforms for overlapped speech recognition: estimating complex ideal ratio mask with an infinite range, incorporating the fbank loss in a multi-objective learning and finetuning the enhancement model by an acoustic model. Experimental results proved the effectiveness of all three approaches on reducing speech distortions and improving recognition accuracy. Particularly, the jointly tuned enhancement model works very well with other standalone acoustic model on real test data. △ Less

Submitted 3 July, 2020; originally announced July 2020.

arXiv:2006.08357 [pdf, other]

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

Authors: Zhen Dong, Dequan Wang, Qijing Huang, Yizhao Gao, Yaohui Cai, Tian Li, Bichen Wu, Kurt Keutzer, John Wawrzynek

Abstract: Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and… ▽ More Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this need, recent work introduces dynamic deformable convolution to augment regular convolutions. However, this will lead to inefficient memory accesses of inputs with existing hardware. In this work, we harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape. We then Co-Design a Network CoDeNet with the modified deformable convolution and quantize it to 4-bit weights and 8-bit activations. With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object detection dataset, Pascal VOC. With our higher accuracy implementation, our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters-20.9x smaller but 10% more accurate than Tiny-YOLO. △ Less

Submitted 25 January, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: Github repo: https://github.com/DequanWang/CoDeNet arXiv:2002.08357 is the preliminary version of this paper

Journal ref: FPGA 2021

arXiv:2006.03677 [pdf, other]

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Authors: Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Peter Vajda

Abstract: Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm b… ▽ More Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x. △ Less

Submitted 19 November, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.10386 [pdf, other]

End-to-End Multi-Look Keyword Spotting

Authors: Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu

Abstract: The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS m… ▽ More The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system. △ Less

Submitted 20 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech2020

arXiv:2005.08571 [pdf, other]

Audio-visual Multi-channel Recognition of Overlapped Speech

Authors: Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu. Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

Abstract: Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separatio… ▽ More Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively. △ Less

Submitted 18 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: submitted to Interspeech 2020

arXiv:2002.08357 [pdf, other]

Algorithm-hardware Co-design for Deformable Convolution

Authors: Qijing Huang, Dequan Wang, Yizhao Gao, Yaohui Cai, Zhen Dong, Bichen Wu, Kurt Keutzer, John Wawrzynek

Abstract: FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the s… ▽ More FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this, recent work proposes dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolutions may access arbitrary pixels in the image and the access pattern is input-dependent and varies per spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and then show the accuracy-latency tradeoffs for a set of algorithm modifications including full versus depthwise, fixed-shape, and limited-range. These modifications benefit the energy efficiency for embedded devices in general as they reduce the compute complexity. We then build an efficient object detection network with modified deformable convolutions and quantize the network using state-of-the-art quantization methods. We implement a unified hardware engine on FPGA to support all the operations in the network. Preliminary experiments show that little accuracy is compromised and speedup can be achieved with our co-design optimization for the deformable convolution. △ Less

Submitted 18 February, 2020; originally announced February 2020.

Journal ref: NeurIPS EMC2 2019

arXiv:2001.11539 [pdf, other]

Adversarial Code Learning for Image Generation

Authors: Jiangbo Yuan, Bing Wu, Wanying Ding, Qing Ping, Zhendong Yu

Abstract: We introduce the "adversarial code learning" (ACL) module that improves overall image generation performance to several types of deep models. Instead of performing a posterior distribution modeling in the pixel spaces of generators, ACLs aim to jointly learn a latent code with another image encoder/inference net, with a prior noise as its input. We conduct the learning in an adversarial learning p… ▽ More We introduce the "adversarial code learning" (ACL) module that improves overall image generation performance to several types of deep models. Instead of performing a posterior distribution modeling in the pixel spaces of generators, ACLs aim to jointly learn a latent code with another image encoder/inference net, with a prior noise as its input. We conduct the learning in an adversarial learning process, which bears a close resemblance to the original GAN but again shifts the learning from image spaces to prior and latent code spaces. ACL is a portable module that brings up much more flexibility and possibilities in generative model designs. First, it allows flexibility to convert non-generative models like Autoencoders and standard classification models to decent generative models. Second, it enhances existing GANs' performance by generating meaningful codes and images from any part of the prior. We have incorporated our ACL module with the aforementioned frameworks and have performed experiments on synthetic, MNIST, CIFAR-10, and CelebA datasets. Our models have achieved significant improvements which demonstrated the generality for image generation tasks. △ Less

Submitted 30 January, 2020; originally announced January 2020.

arXiv:2001.09227 [pdf, other]

doi 10.1109/CDC42340.2020.9304190

Active Task-Inference-Guided Deep Inverse Reinforcement Learning

Authors: Farzan Memarian, Zhe Xu, Bo Wu, Min Wen, Ufuk Topcu

Abstract: We consider the problem of reward learning for temporally extended tasks. For reward learning, inverse reinforcement learning (IRL) is a widely used paradigm. Given a Markov decision process (MDP) and a set of demonstrations for a task, IRL learns a reward function that assigns a real-valued reward to each state of the MDP. However, for temporally extended tasks, the underlying reward function may… ▽ More We consider the problem of reward learning for temporally extended tasks. For reward learning, inverse reinforcement learning (IRL) is a widely used paradigm. Given a Markov decision process (MDP) and a set of demonstrations for a task, IRL learns a reward function that assigns a real-valued reward to each state of the MDP. However, for temporally extended tasks, the underlying reward function may not be expressible as a function of individual states of the MDP. Instead, the history of visited states may need to be considered to determine the reward at the current state. To address this issue, we propose an iterative algorithm to learn a reward function for temporally extended tasks. At each iteration, the algorithm alternates between two modules, a task inference module that infers the underlying task structure and a reward learning module that uses the inferred task structure to learn a reward function. The task inference module produces a series of queries, where each query is a sequence of subgoals. The demonstrator provides a binary response to each query by attempting to execute it in the environment and observing the environment's feedback. After the queries are answered, the task inference module returns an automaton encoding its current hypothesis of the task structure. The reward learning module augments the state space of the MDP with the states of the automaton. The module then proceeds to learn a reward function over the augmented state space using a novel deep maximum entropy IRL algorithm. This iterative process continues until it learns a reward function with satisfactory performance. The experiments show that the proposed algorithm significantly outperforms several IRL baselines on temporally extended tasks. △ Less

Submitted 10 September, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

Comments: To be published in IEEE Conference on Decision and Control (CDC) 2020

arXiv:2001.05685 [pdf, other]

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Authors: Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E. Gonzalez, Kurt Keutzer

Abstract: Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGl… ▽ More Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave. △ Less

Submitted 16 January, 2020; originally announced January 2020.

arXiv:2001.01656 [pdf, other]

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Authors: Jianwei Yu, Shi-Xiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu

Abstract: Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-… ▽ More Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89\% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: 5 pages, 5 figures, submitted to icassp2019

arXiv:2001.00835 [pdf, other]

Policy Synthesis for Switched Linear Systems with Markov Decision Process Switching

Authors: Bo Wu, Murat Cubuktepe, Franck Djeumou, Zhe Xu, Ufuk Topcu

Abstract: We study the synthesis of mode switching protocols for a class of discrete-time switched linear systems in which the mode jumps are governed by Markov decision processes (MDPs). We call such systems MDP-JLS for brevity. Each state of the MDP corresponds to a mode in the switched system. The probabilistic state transitions in the MDP represent the mode transitions. We focus on finding a policy that… ▽ More We study the synthesis of mode switching protocols for a class of discrete-time switched linear systems in which the mode jumps are governed by Markov decision processes (MDPs). We call such systems MDP-JLS for brevity. Each state of the MDP corresponds to a mode in the switched system. The probabilistic state transitions in the MDP represent the mode transitions. We focus on finding a policy that selects the switching actions at each mode such that the switched system that follows these actions is guaranteed to be stable. Given a policy in the MDP, the considered MDP-JLS reduces to a Markov jump linear system (MJLS). {We consider both mean-square stability and stability with probability one. For mean-square stability, we leverage existing stability conditions for MJLSs and propose efficient semidefinite programming formulations to find a stabilizing policy in the MDP. For stability with probability one, we derive new sufficient conditions and compute a stabilizing policy using linear programming. We also extend the policy synthesis results to MDP-JLS with uncertain mode transition probabilities. △ Less

Submitted 3 January, 2020; originally announced January 2020.

Comments: arXiv admin note: text overlap with arXiv:1904.11456

arXiv:1912.06986 [pdf]

doi 10.1109/TNANO.2020.2999751

Erase-hidden and Drivability-improved Magnetic Non-Volatile Flip-Flops with NAND-SPIN Devices

Authors: Ziyi Wang, Zhaohao Wang, Yansong Xu, Bi Wu, Weisheng Zhao

Abstract: Non-volatile flip-flops (NVFFs) using power gating techniques promise to overcome the soaring leakage power consumption issue with the scaling of CMOS technology. Magnetic tunnel junction (MTJ) is a good candidate for constructing the NVFF thanks to its low power, high speed, good CMOS compatibility, etc. In this paper, we propose a novel magnetic NVFF based on an emerging memory device called NAN… ▽ More Non-volatile flip-flops (NVFFs) using power gating techniques promise to overcome the soaring leakage power consumption issue with the scaling of CMOS technology. Magnetic tunnel junction (MTJ) is a good candidate for constructing the NVFF thanks to its low power, high speed, good CMOS compatibility, etc. In this paper, we propose a novel magnetic NVFF based on an emerging memory device called NAND-SPIN. The data writing of NAND-SPIN is achieved by successively applying two unidirectional currents, which respectively generate the spin orbit torque (SOT) and spin transfer torque (STT) for erase and programming operations. This characteristic allows us to design an erase-hidden and drivability-improved magnetic NVFF. Furthermore, more design flexibility could be obtained since the backup operation of the proposed NVFF is not limited by the inherent slave latch. Simulation results show that our proposed NVFF achieves performance improvement in terms of power, delay and area, compared with conventional slave-latch-driven SOT-NVFF designs. △ Less

Submitted 17 June, 2020; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: This article has been accepted in a future issue of IEEE Transactions on Nanotechnology: Regular Papers

arXiv:1910.13825 [pdf, ps, other]

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

Authors: Bo Wu, Meng Yu, Lianwu Chen, Chao Weng, Dan Su, Dong Yu

Abstract: We propose an end-to-end joint optimization framework of a multi-channel neural speech extraction and deep acoustic model without mel-filterbank (FBANK) extraction for overlapped speech recognition. First, based on a multi-channel convolutional TasNet with STFT kernel, we unify the multi-channel target speech enhancement front-end network and a convolutional, long short-term memory and fully conne… ▽ More We propose an end-to-end joint optimization framework of a multi-channel neural speech extraction and deep acoustic model without mel-filterbank (FBANK) extraction for overlapped speech recognition. First, based on a multi-channel convolutional TasNet with STFT kernel, we unify the multi-channel target speech enhancement front-end network and a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model (AM) with the FBANK extraction layer to build a hybrid neural network, which is thus jointly updated only by the recognition loss. The proposed framework achieves 28% word error rate reduction (WERR) over a separately optimized system on AISHELL-1 and shows consistent robustness to signal to interference ratio (SIR) and angle difference between overlapping speakers. Next, a further exploration shows that the speech recognition is improved with a simplified structure by replacing the FBANK extraction layer in the joint model with a learnable feature projection. Finally, we also perform the objective measurement of speech quality on the reconstructed waveform from the enhancement network in the joint model. △ Less

Submitted 30 October, 2019; originally announced October 2019.

arXiv:1909.09939 [pdf, other]

Controller Synthesis for Multi-Agent Systems With Intermittent Communication: A Metric Temporal Logic Approach

Authors: Zhe Xu, Federico M. Zegers, Bo Wu, Warren Dixon, Ufuk Topcu

Abstract: This paper develops a controller synthesis approach for a multi-agent system (MAS) with intermittent communication. We adopt a leader-follower scheme, where a mobile leader with absolute position sensors switches among a set of followers without absolute position sensors to provide each follower with intermittent state information.We model the MAS as a switched system. The followers are to asympto… ▽ More This paper develops a controller synthesis approach for a multi-agent system (MAS) with intermittent communication. We adopt a leader-follower scheme, where a mobile leader with absolute position sensors switches among a set of followers without absolute position sensors to provide each follower with intermittent state information.We model the MAS as a switched system. The followers are to asymptotically reach a predetermined consensus state. To guarantee the stability of the switched system and the consensus of the followers, we derive maximum and minimal dwell-time conditions to constrain the intervals between consecutive time instants at which the leader should provide state information to the same follower. Furthermore, the leader needs to satisfy practical constraints such as charging its battery and staying in specific regions of interest. Both the maximum and minimum dwell-time conditions and these practical constraints can be expressed by metric temporal logic (MTL) specifications. We iteratively compute the optimal control inputs such that the leader satisfies the MTL specifications, while guaranteeing stability and consensus of the followers. We implement the proposed method on a case study with three mobile robots as the followers and one quadrotor as the leader. △ Less

Submitted 22 September, 2019; originally announced September 2019.

arXiv:1907.03988 [pdf, other]

doi 10.1109/ICASSP40776.2020.9052932

Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

Authors: Zhenyu Tang, Lianwu Chen, Bo Wu, Dong Yu, Dinesh Manocha

Abstract: We present an efficient and realistic geometric acoustic simulation approach for generating and augmenting training data in speech-related machine learning tasks. Our physically-based acoustic simulation method is capable of modeling occlusion, specular and diffuse reflections of sound in complicated acoustic environments, whereas the classical image method can only model specular reflections in s… ▽ More We present an efficient and realistic geometric acoustic simulation approach for generating and augmenting training data in speech-related machine learning tasks. Our physically-based acoustic simulation method is capable of modeling occlusion, specular and diffuse reflections of sound in complicated acoustic environments, whereas the classical image method can only model specular reflections in simple room settings. We show that by using our synthetic training data, the same neural networks gain significant performance improvement on real test sets in far-field speech recognition by 1.58% and keyword spotting by 21%, without fine-tuning using real impulse responses. △ Less

Submitted 2 April, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: Accepted to ICASSP 2020, impulse response generation code at https://github.com/RoyJames/pygsound

Journal ref: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6969-6973)

arXiv:1906.00884 [pdf, other]

Fashion Editing with Adversarial Parsing Learning

Authors: Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Zhenyu Xie, Bowen Wu, Ziqi Zhang, Xiaohui Shen, Jian Yin

Abstract: Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to re… ▽ More Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to restore the incomplete image, which tends to wash away the sketch and color information. In this paper, we propose a novel Fashion Editing Generative Adversarial Network (FE-GAN), which is capable of manipulating fashion images by free-form sketches and sparse color strokes. FE-GAN consists of two modules: 1) a free-form parsing network that learns to control the human parsing generation by manipulating sketch and color; 2) a parsing-aware inpainting network that renders detailed textures with semantic guidance from the human parsing map. A new attention normalization layer is further applied at multiple scales in the decoder of the inpainting network to enhance the quality of the synthesized image. Extensive experiments on high-resolution fashion image datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods on image manipulation. △ Less

Submitted 28 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: 22 pages, 18 figures

arXiv:1905.08095 [pdf, other]

Control Theory Meets POMDPs: A Hybrid Systems Approach

Authors: Mohamadreza Ahmadi, Nils Jansen, Bo Wu, Ufuk Topcu

Abstract: Partially observable Markov decision processes (POMDPs) provide a modeling framework for a variety of sequential decision making under uncertainty scenarios in artificial intelligence (AI). Since the states are not directly observable in a POMDP, decision making has to be performed based on the output of a Bayesian filter (continuous beliefs). Hence, POMDPs are often computationally intractable to… ▽ More Partially observable Markov decision processes (POMDPs) provide a modeling framework for a variety of sequential decision making under uncertainty scenarios in artificial intelligence (AI). Since the states are not directly observable in a POMDP, decision making has to be performed based on the output of a Bayesian filter (continuous beliefs). Hence, POMDPs are often computationally intractable to solve exactly and researchers resort to approximate methods often using discretizations of the continuous belief space. These approximate solutions are, however, prone to discretization errors, which has made POMDPs ineffective in applications, wherein guarantees for safety, optimality, or performance are required. To overcome the complexity challenge of POMDPs, we apply notions from control theory. The goal is to determine the reachable belief space of a POMDP, that is, the set of all possible evolutions given an initial belief distribution over the states and a set of actions and observations. We begin by casting the problem of analyzing a POMDP into analyzing the behavior of a discrete-time switched system. For estimating the reachable belief space, we find over-approximations in terms of sub-level sets of Lyapunov functions. Furthermore, in order to verify safety and optimality requirements of a given POMDP, we formulate a barrier certificate theorem, wherein we show that if there exists a barrier certificate satisfying a set of inequalities along with the belief update equation of the POMDP, the safety and optimality properties are guaranteed to hold. In both cases, we show how the calculations can be decomposed into smaller problems that can be solved in parallel. The conditions we formulate can be computationally implemented as a set of sum-of-squares programs. We illustrate the applicability of our method by addressing two problems in active ad scheduling and machine teaching. △ Less

Submitted 17 May, 2019; originally announced May 2019.

Comments: arXiv admin note: text overlap with arXiv:1810.00093

arXiv:1904.08031 [pdf, ps, other]

Hard Sample Mining for the Improved Retraining of Automatic Speech Recognition

Authors: Jiabin Xue, Jiqing Han, Tieran Zheng, Jiaxing Guo, Boyong Wu

Abstract: It is an effective way that improves the performance of the existing Automatic Speech Recognition (ASR) systems by retraining with more and more new training data in the target domain. Recently, Deep Neural Network (DNN) has become a successful model in the ASR field. In the training process of the DNN based methods, a back propagation of error between the transcription and the corresponding annot… ▽ More It is an effective way that improves the performance of the existing Automatic Speech Recognition (ASR) systems by retraining with more and more new training data in the target domain. Recently, Deep Neural Network (DNN) has become a successful model in the ASR field. In the training process of the DNN based methods, a back propagation of error between the transcription and the corresponding annotated text is used to update and optimize the parameters. Thus, the parameters are more influenced by the training samples with a big propagation error than the samples with a small one. In this paper, we define the samples with significant error as the hard samples and try to improve the performance of the ASR system by adding many of them. Unfortunately, the hard samples are sparse in the training data of the target domain, and manually label them is expensive. Therefore, we propose a hard samples mining method based on an enhanced deep multiple instance learning, which can find the hard samples from unlabeled training data by using a small subset of the dataset with manual labeling in the target domain. We applied our method to an End2End ASR task and obtained the best performance. △ Less

Submitted 16 April, 2019; originally announced April 2019.

Comments: Submitted to Interspeech 2019;

Showing 1–50 of 67 results for author: Wu, B