-
Context-Aware Optimal Transport Learning for Retinal Fundus Image Enhancement
Authors:
Vamsi Krishna Vasa,
Peijie Qiu,
Wenhui Zhu,
Yujian Xiong,
Oana Dumitrascu,
Yalin Wang
Abstract:
Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distributi…
▽ More
Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distribution alignment problem, by finding a one-to-one mapping between a low-quality image and its high-quality counterpart. This paper proposes a context-informed optimal transport (OT) learning framework for tackling unpaired fundus image enhancement. In contrast to standard generative image enhancement methods, which struggle with handling contextual information (e.g., over-tampered local structures and unwanted artifacts), the proposed context-aware OT learning paradigm better preserves local structures and minimizes unwanted artifacts. Leveraging deep contextual features, we derive the proposed context-aware OT using the earth mover's distance and show that the proposed context-OT has a solid theoretical guarantee. Experimental results on a large-scale dataset demonstrate the superiority of the proposed method over several state-of-the-art supervised and unsupervised methods in terms of signal-to-noise ratio, structural similarity index, as well as two downstream tasks. The code is available at \url{https://github.com/Retinal-Research/Contextual-OT}.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Authors:
Qian Wang,
Zhaoyang Bu,
Jiaxuan Mao,
Wenyu Zhu,
Jingya Zhao,
Wei Du,
Guochao Shi,
Min Zhou,
Si Chen,
Jieming Qu
Abstract:
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or dee…
▽ More
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
△ Less
Submitted 2 September, 2024; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Joint Transmission and Compression Optimization for Networked Sensing with Limited-Capacity Fronthaul Links
Authors:
Weifeng Zhu,
Shuowen Zhang,
Liang Liu
Abstract:
This paper considers networked sensing in cellular network, where multiple base stations (BSs) first compress their received echo signals from multiple targets and then forward the quantized signals to the central unit (CU) via limited-capacity fronthaul links, such that the CU can leverage all useful echo signals to perform high-resolution localization. Under this setup, we manage to characterize…
▽ More
This paper considers networked sensing in cellular network, where multiple base stations (BSs) first compress their received echo signals from multiple targets and then forward the quantized signals to the central unit (CU) via limited-capacity fronthaul links, such that the CU can leverage all useful echo signals to perform high-resolution localization. Under this setup, we manage to characterize the posterior Cramer-Rao Bound (PCRB) for localizing all the targets with random positions as a function of the transmit covariance matrix and the compression noise covariance matrix of each BS. Then, a PCRB minimization problem subject to the transmit power constraints and the fronthaul capacity constraints is formulated to jointly design the BSs' transmission and compression strategies. We propose an efficient algorithm to solve this problem based on the alternating optimization technique. Specifically, it is shown that when either the transmit covariance matrices or the compression noise covariance matrices are fixed, the successive convex approximation (SCA) technique can be leveraged to optimize the other type of covariance matrices locally optimally. Moreover, we also propose a novel estimate-then-beamform-then-compress strategy for the massive receive antenna scenario, under which each BS first estimates targets' angle-of-arrivals (AOAs) locally, then beamforms its high-dimension received signals into low-dimension ones based on the estimated AOAs, and last compresses the beamformed signals for fronthaul transmission. An efficient beamforming and compression design method is devised under this strategy. Numerical results are provided to verify the effectiveness of our proposed algorithms.
△ Less
Submitted 6 September, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
CC-DCNet: Dynamic Convolutional Neural Network with Contrastive Constraints for Identifying Lung Cancer Subtypes on Multi-modality Images
Authors:
Yuan Jin,
Gege Ma,
Geng Chen,
Tianling Lyu,
Jan Egger,
Junhui Lyu,
Shaoting Zhang,
Wentao Zhu
Abstract:
The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we prop…
▽ More
The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we propose a novel deep learning network designed to accurately classify lung cancer subtype with multi-dimensional and multi-modality images, i.e., CT and pathological images. The strength of the proposed model lies in its ability to dynamically process both paired CT-pathological image sets as well as independent CT image sets, and consequently optimize the pathology-related feature extractions from CT images. This adaptive learning approach enhances the flexibility in processing multi-dimensional and multi-modality datasets and results in performance elevating in the model testing phase. We also develop a contrastive constraint module, which quantitatively maps the cross-modality associations through network training, and thereby helps to explore the "gold standard" pathological information from the corresponding CT scans. To evaluate the effectiveness, adaptability, and generalization ability of our model, we conducted extensive experiments on a large-scale multi-center dataset and compared our model with a series of state-of-the-art classification models. The experimental results demonstrated the superiority of our model for lung cancer subtype classification, showcasing significant improvements in accuracy metrics such as ACC, AUC, and F1-score.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
RBAD: A Dataset and Benchmark for Retinal Vessels Branching Angle Detection
Authors:
Hao Wang,
Wenhui Zhu,
Jiayou Qin,
Xin Li,
Oana Dumitrascu,
Xiwen Chen,
Peijie Qiu,
Abolfazl Razi
Abstract:
Detecting retinal image analysis, particularly the geometrical features of branching points, plays an essential role in diagnosing eye diseases. However, existing methods used for this purpose often are coarse-level and lack fine-grained analysis for efficient annotation. To mitigate these issues, this paper proposes a novel method for detecting retinal branching angles using a self-configured ima…
▽ More
Detecting retinal image analysis, particularly the geometrical features of branching points, plays an essential role in diagnosing eye diseases. However, existing methods used for this purpose often are coarse-level and lack fine-grained analysis for efficient annotation. To mitigate these issues, this paper proposes a novel method for detecting retinal branching angles using a self-configured image processing technique. Additionally, we offer an open-source annotation tool and a benchmark dataset comprising 40 images annotated with retinal branching angles. Our methodology for retinal branching angle detection and calculation is detailed, followed by a benchmark analysis comparing our method with previous approaches. The results indicate that our method is robust under various conditions with high accuracy and efficiency, which offers a valuable instrument for ophthalmic research and clinical applications.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review
Authors:
Rui Jin,
Derun Li,
Dehui Xiang,
Lei Zhang,
Hailing Zhou,
Fei Shi,
Weifang Zhu,
Jing Cai,
Tao Peng,
Xinjian Chen
Abstract:
Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The ad…
▽ More
Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The advent of precision medicine and a significant increase in clinical capacity have spurred the need for various data-driven tasks in the field of medical imaging. Recently, numerous machine learning and data mining tools have been integrated into various medical areas, including image segmentation. This article proposes a new classification method that differentiates supervision types, either in number or kind, during the training phase. Subsequently, we conducted a survey on artificial intelligence (AI)-based automatic prostate segmentation methods, examining the advantages and limitations of each. Additionally, we introduce variants of evaluation metrics for the verification and performance assessment of the segmentation method and summarize the current challenges. Finally, future research directions and development trends are discussed, reflecting the outcomes of our literature survey, suggesting high-precision detection and treatment of prostate cancer as a promising avenue.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification
Authors:
Wenhui Zhu,
Xiwen Chen,
Peijie Qiu,
Aristeidis Sotiras,
Abolfazl Razi,
Yalin Wang
Abstract:
Multiple instance learning (MIL) stands as a powerful approach in weakly supervised learning, regularly employed in histological whole slide image (WSI) classification for detecting tumorous lesions. However, existing mainstream MIL methods focus on modeling correlation between instances while overlooking the inherent diversity among instances. However, few MIL methods have aimed at diversity mode…
▽ More
Multiple instance learning (MIL) stands as a powerful approach in weakly supervised learning, regularly employed in histological whole slide image (WSI) classification for detecting tumorous lesions. However, existing mainstream MIL methods focus on modeling correlation between instances while overlooking the inherent diversity among instances. However, few MIL methods have aimed at diversity modeling, which empirically show inferior performance but with a high computational cost. To bridge this gap, we propose a novel MIL aggregation method based on diverse global representation (DGR-MIL), by modeling diversity among instances through a set of global vectors that serve as a summary of all instances. First, we turn the instance correlation into the similarity between instance embeddings and the predefined global vectors through a cross-attention mechanism. This stems from the fact that similar instance embeddings typically would result in a higher correlation with a certain global vector. Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification learning paradigm. Specifically, the positive instance alignment module encourages the global vectors to align with the center of positive instances (e.g., instances containing tumors in WSI). To further diversify the global representations, we propose a novel diversification learning paradigm leveraging the determinantal point process. The proposed model outperforms the state-of-the-art MIL aggregation models by a substantial margin on the CAMELYON-16 and the TCGA-lung cancer datasets. The code is available at \url{https://github.com/ChongQingNoSubway/DGR-MIL}.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation
Authors:
Wenhui Zhu,
Xiwen Chen,
Peijie Qiu,
Mohammad Farazi,
Aristeidis Sotiras,
Abolfazl Razi,
Yalin Wang
Abstract:
Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important facto…
▽ More
Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets. The code is available at \url{https://github.com/ChongQingNoSubway/SelfReg-UNet}
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
LEO Satellite Networks Assisted Geo-distributed Data Processing
Authors:
Zhiyuan Zhao,
Zhe Chen,
Zheng Lin,
Wenjun Zhu,
Kun Qiu,
Chaoqun You,
Yue Gao
Abstract:
Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective soluti…
▽ More
Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective solution for increasing edge clouds, and hence large volumes of data in edge clouds can be transferred to a core cloud via those networks for time-sensitive big data tasks processing, such as attack detection. However, the state-of-the-art satellite selection algorithms bring poor performance for those processing via our measurements. Therefore, we propose a novel data volume aware satellite selection algorithm, named DVA, to support such big data processing tasks. DVA first takes into account both the data size in edge clouds and satellite capacity to finalize the selection, thereby preventing congestion in the access network and reducing transmitting duration. Extensive simulations validate that DVA has a significantly lower average access network duration than the state-of-the-art satellite selection algorithms in a LEO satellite emulation platform.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
A novel fault localization with data refinement for hydroelectric units
Authors:
Jialong Huang,
Junlin Song,
Penglong Lian,
Mengjie Gan,
Zhiheng Su,
Benhao Wang,
Wenji Zhu,
Xiaomin Pu,
Jianxiao Zou,
Shicai Fan
Abstract:
Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learni…
▽ More
Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
Authors:
Jun Yu,
Gongpeng Zhao,
Yongqi Wang,
Zhihong Wei,
Yang Zheng,
Zerui Zhang,
Zhongpeng Cai,
Guochen Xie,
Jichao Zhu,
Wangyuan Zhu
Abstract:
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we…
▽ More
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.
△ Less
Submitted 20 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
Authors:
Jun Yu,
Wangyuan Zhu,
Jichao Zhu
Abstract:
In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain"…
▽ More
In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($ρ$) across the 6 emotion dimensionson the validation set achieving 0.3288.
△ Less
Submitted 19 March, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Terahertz User-Centric Clustering in the Presence of Beam Misalignment
Authors:
Khaled Humadi,
Imene Trigui,
Wei-Ping Zhu,
Wessam Ajib
Abstract:
Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. Th…
▽ More
Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. The coverage probability is derived by leveraging an accurate approximation of the aggregate interference distribution that captures the effect of beam misalignment and THz fading. The numerical results reveal the impact of beam misalignment with respect to crucial link parameters, such as the transmitter's beam width and the serving cluster size, demonstrating that user-centric BS clustering is a promising enabler of THz networks.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Improvising Age Verification Technologies in Canada: Technical, Regulatory and Social Dynamics
Authors:
Azfar Adib,
Wei-Ping Zhu,
M. Omair Ahmad
Abstract:
Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deploym…
▽ More
Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deployment and mass acceptance of these technologies are significantly dependent on the corresponding socio-economic and regulatory context. This paper reviews such key dynamics for improvising age-verification technologies in Canada. It is particularly essential for such technologies to be inclusive, transparent, adaptable, privacy-preserving, and secure. Effective collaboration between academia, government, and industry entities can help to meet the growing demands for age-verification services in Canada while maintaining a user-centric approach.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
Authors:
Wentao Zhu
Abstract:
Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply conc…
▽ More
Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Authors:
Wentao Zhu
Abstract:
In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a nove…
▽ More
In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Deformable Audio Transformer for Audio Event Detection
Authors:
Wentao Zhu
Abstract:
Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be…
▽ More
Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance.
△ Less
Submitted 7 January, 2024; v1 submitted 24 December, 2023;
originally announced December 2023.
-
Deep Learning for Joint Design of Pilot, Channel Feedback, and Hybrid Beamforming in FDD Massive MIMO-OFDM Systems
Authors:
Junyi Yang,
Weifeng Zhu,
Shu Sun,
Xiaofeng Li,
Xingqin Lin,
Meixia Tao
Abstract:
This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networ…
▽ More
This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networks. All the considered modules are jointly learned in an end-to-end manner, and a graph neural network is adopted to effectively capture interactions between beamformers based on the built graphical representation. Numerical results validate the effectiveness of our method.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
Long-Term Rate-Fairness-Aware Beamforming Based Massive MIMO Systems
Authors:
W. Zhu,
H. D. Tuan,
E. Dutkiewicz,
Y. Fang,
H. V. Poor,
L. Hanzo
Abstract:
This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLN…
▽ More
This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLNR max-min optimization, geometric mean of SLNRs (GM-SLNR) based optimization, and SLNR soft max-min optimization. We develop a convex-solver based algorithm, which invokes a convex subproblem of cubic time-complexity at each iteration for solving the SLNR maxmin problem. We then develop closed-form expression based algorithms of scalable complexity for the solution of the GMSLNR and of the SLNR soft max-min problem. The simulations provided confirm the users' improved-fairness ergodic rate distributions.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
An ADMM-Based Geometric Configuration Optimization in RSSD-Based Source Localization By UAVs with Spread Angle Constraint
Authors:
Xin Cheng,
Guangjie Han,
Jinlin Peng,
Jinfang Jiang,
Yu He,
Weiqiang Zhu,
Feng Shu,
Jiangzhou Wang
Abstract:
Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for rece…
▽ More
Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for received signal strength difference (RSSD)-based passive source localization by drone swarm. Different from prior works, this paper considers a general measuring condition where the spread angle of drone swarm centered on the source is constrained. Subject to this constraint, a geometric configuration optimization problem with the aim of maximizing the determinant of Fisher information matrix (FIM) is formulated. After transforming this problem using matrix theory, an alternating direction method of multipliers (ADMM)-based optimization framework is proposed. To solve the subproblems in this framework, two global optimal solutions based on the Von Neumann matrix trace inequality theorem and majorize-minimize (MM) algorithm are proposed respectively. Finally, the effectiveness as well as the practicality of the proposed ADMM-based optimization algorithm are demonstrated by extensive simulations.
△ Less
Submitted 17 July, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness
Authors:
W. Zhu,
H. D. Tuan,
E. Dutkiewicz,
H. V. Poor,
L. Hanzo
Abstract:
A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We…
▽ More
A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We develop a convexsolver based algorithm, which iteratively invokes a convex problem of the same beamformer size for its solution. We then introduce the soft max-min rate objective function and develop a scalable algorithm for its optimization. Our simulation results demonstrate the striking fact that soft max-min rate optimization not only approaches the minimum user rate obtained by max-min rate optimization but it also achieves a sum rate similar to that of sum-rate maximization. Thus, the soft max-min rate optimization based beamforming design conceived offers a new technique of simultaneously achieving a high individual quality-of-service for all users and a high total network throughput.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images
Authors:
Yangfan Ni,
Duo Zhang,
Gege Ma,
Lijun Lu,
Zhongke Huang,
Wentao Zhu
Abstract:
Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of i…
▽ More
Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Hierarchical Beam Alignment for Millimeter-Wave Communication Systems: A Deep Learning Approach
Authors:
Junyi Yang,
Weifeng Zhu,
Meixia Tao,
Shu Sun
Abstract:
Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-ou…
▽ More
Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-output (MIMO) systems, which learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine search manner. Specifically, a hierarchical beam alignment network (HBAN) is developed for MISO systems, which first performs coarse channel measurement using a tier-1 PC, then selects a tier-2 PC for fine channel measurement, and finally predicts the optimal beam based on both coarse and fine measurements. The propounded HBAN is trained in two steps: the tier-1 PC and the tier-2 PC selector are first trained jointly, followed by the joint training of all the tier-2 PCs and beam predictors. Furthermore, an HBAN for MIMO systems is proposed to directly predict the optimal beam pair without performing beam alignment individually at the transmitter and receiver. Numerical results demonstrate that the proposed HBANs are superior to the state-of-art methods in both alignment accuracy and signaling overhead reduction.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Classification of lung cancer subtypes on CT images with synthetic pathological priors
Authors:
Wentao Zhu,
Yuan Jin,
Gege Ma,
Geng Chen,
Jan Egger,
Shaoting Zhang,
Dimitris N. Metaxas
Abstract:
The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns betwe…
▽ More
The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Authors:
Aoqi Guo,
Junnan Wu,
Peng Gao,
Wenbo Zhu,
Qinwen Guo,
Dazhi Gao,
Yujun Wang
Abstract:
Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input fea…
▽ More
Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
PDS-MAR: a fine-grained Projection-Domain Segmentation-based Metal Artifact Reduction method for intraoperative CBCT images with guidewires
Authors:
Tianling Lyu,
Zhan Wu,
Gege Ma,
Chen Jiang,
Xinyun Zhong,
Yan Xi,
Yang Chen,
Wentao Zhu
Abstract:
Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventiona…
▽ More
Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventional image-domain segmentation-based MAR methods are unable to eliminate metal artifacts for intraoperative CBCT images with guidewires. To solve this problem, we present a fine-grained projection-domain segmentation-based MAR method termed PDS-MAR, in which metal traces are augmented and segmented in the projection domain before being inpainted using triangular interpolation. In addition, a metal reconstruction phase is proposed to restore metal areas in the image domain. The digital phantom study and real CBCT data study demonstrate that the proposed algorithm achieves significantly better artifact suppression than other comparing methods and has the potential to advance the use of intraoperative CBCT imaging in clinical spine surgeries.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
nnMobileNet: Rethinking CNN for Retinopathy Research
Authors:
Wenhui Zhu,
Peijie Qiu,
Xiwen Chen,
Xin Li,
Natasha Lepore,
Oana M. Dumitrascu,
Yalin Wang
Abstract:
Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability…
▽ More
Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability to improve as more parameters are added. As a result, ViT-based models tend to outshine traditional CNNs in RD applications, albeit at the cost of increased data and computational demands. ViTs also differ from CNNs in their approach to processing images, working with patches rather than local regions, which can complicate the precise localization of small, variably presented lesions in RD. In our study, we revisited and updated the architecture of a CNN model, specifically MobileNet, to enhance its utility in RD diagnostics. We found that an optimized MobileNet, through selective modifications, can surpass ViT-based models in various RD benchmarks, including diabetic retinopathy grading, detection of multiple fundus diseases, and classification of diabetic macular edema. The code is available at https://github.com/Retinal-Research/NN-MOBILENET
△ Less
Submitted 15 April, 2024; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Surface EMG-Based Inter-Session/Inter-Subject Gesture Recognition by Leveraging Lightweight All-ConvNet and Transfer Learning
Authors:
Md. Rabiul Islam,
Daniel Massicotte,
Philippe Y. Massicotte,
Wei-Ping Zhu
Abstract:
Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate th…
▽ More
Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability. Hence, these methods also require learning over millions of training parameters and a large pre-trained and target domain dataset in both the pre-training and adaptation stages. As a result, it makes high-end resource-bounded and computationally very expensive for deployment in real-time applications. To overcome this problem, we propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition performance. The All-ConvNet+TL model consists solely of convolutional layers, a simple yet efficient framework for learning invariant and discriminative representations to address the distribution shifts caused by inter-session and inter-subject data variability. Experiments on four datasets demonstrate that our proposed methods outperform the most complex existing approaches by a large margin and achieve state-of-the-art results on inter-session and inter-subject scenarios and perform on par or competitively on intra-session gesture recognition. These performance gaps increase even more when a tiny amount (e.g., a single trial) of data is available on the target domain for adaptation. These outstanding experimental results provide evidence that the current state-of-the-art models may be overparameterized for sEMG-based inter-session and inter-subject gesture recognition tasks.
△ Less
Submitted 19 February, 2024; v1 submitted 13 May, 2023;
originally announced May 2023.
-
Cooperative Multi-Cell Massive Access with Temporally Correlated Activity
Authors:
Weifeng Zhu,
Meixia Tao,
Xiaojun Yuan,
Fan Xu,
Yunfeng Guan
Abstract:
This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for…
▽ More
This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for fully exploiting the temporal correlation in activity. By establishing the probabilistic model associated with the factor graph representation, we propose a scalable Dynamic Compressed Sensing-based Multiple Measurement Vector Generalized Approximate Message Passing (DCS-MMV-GAMP) algorithm from the perspective of Bayesian inference. Therein, the activity likelihood is refined by performing standard message passing among the activities in the spatial-temporal domain and GAMP is employed for efficient channel estimation. Furthermore, we develop two schemes of quantize-and-forward (QF) and detect-and-forward (DF) based on DCS-MMV-GAMP for the finite-fronthaul-capacity scenario, which are extensively evaluated under various system limits. Numerical results verify the significant superiority of the proposed approach over the benchmarks. Moreover, it is revealed that QF can usually realize superior performance when the antenna number is small, whereas DF shifts to be preferable with limited fronthaul capacity if the large-scale antenna arrays are equipped.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Multiscale Audio Spectrogram Transformer for Efficient Audio Classification
Authors:
Wentao Zhu,
Mohamed Omar
Abstract:
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along…
▽ More
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge
Authors:
Yukai Ju,
Jun Chen,
Shimin Zhang,
Shulin He,
Wei Rao,
Weixin Zhu,
Yannan Wang,
Tao Yu,
Shidong Shang
Abstract:
This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) struct…
▽ More
This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) structure is introduced to boost speaker information extraction, and multi-STFT resolution loss is used to effectively capture the time-frequency characteristics of the speech signals. Moreover, retraining methods are employed based on the freeze training strategy to fine-tune the system. According to the official results, TEA-PSE 3.0 ranks 1st in both ICASSP 2023 DNS-Challenge track 1 and track 2.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning
Authors:
Zhaoxi Mu,
Xinyu Yang,
Wenjing Zhu
Abstract:
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network…
▽ More
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments
Authors:
Zhaoxi Mu,
Xinyu Yang,
Xiangyuan Yang,
Wenjing Zhu
Abstract:
In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech…
▽ More
In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech denoising, separation, and de-reverberation. The probability and speed of searching for the optimal solution of the speech separation model are improved by reducing the solution space. Moreover, since the channel information of the audio sequence in the time domain is crucial for speech separation, we propose a triple-path structure capable of modeling the channel dimension of audio sequences. Experimental results show that the proposed multi-stage triple-path method can improve the performance of speech separation models at the cost of little model parameter increment.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Low-Complexity Pareto-Optimal 3D Beamforming for the Full-Dimensional Multi-User Massive MIMO Downlink
Authors:
W. Zhu,
H. D. Tuan,
E. Dutkiewicz,
Y. Fang,
L. Hanzo
Abstract:
Full-dimensional (FD) multi-user massive multiple input multiple output (m-MIMO) systems employ large two-dimensional (2D) rectangular antenna arrays to control both the azimuth and elevation angles of signal transmission. We introduce the sum of two outer products of the azimuth and elevation beamforming vectors having moderate dimensions as a new class of FD beamforming. We show that this low-co…
▽ More
Full-dimensional (FD) multi-user massive multiple input multiple output (m-MIMO) systems employ large two-dimensional (2D) rectangular antenna arrays to control both the azimuth and elevation angles of signal transmission. We introduce the sum of two outer products of the azimuth and elevation beamforming vectors having moderate dimensions as a new class of FD beamforming. We show that this low-complexity class is capable of outperforming 2D beamforming relying on the single outer product of the azimuth and elevation beamforming vectors. It is also capable of performing close to its FD counterpart of massive dimensions in terms of either the users minimum rate or their geometric mean rate (GM-rate), or sum rate (SR). Furthermore, we also show that even FD beamforming may be outperformed by our outer product-based improper Gaussian signaling solution. Explicitly, our design is based on low-complexity algorithms relying on convex problems of moderate dimensions for max-min rate optimization or on closed-form expressions for GM-rate and SR maximization.
△ Less
Submitted 18 February, 2023;
originally announced February 2023.
-
OTRE: Where Optimal Transport Guided Unpaired Image-to-Image Translation Meets Regularization by Enhancing
Authors:
Wenhui Zhu,
Peijie Qiu,
Oana M. Dumitrascu,
Jacob M. Sobczak,
Mohammad Farazi,
Zhangsihao Yang,
Keshav Nandakumar,
Yalin Wang
Abstract:
Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the Optimal Transport (OT) theory to propose an…
▽ More
Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the Optimal Transport (OT) theory to propose an unpaired image-to-image translation scheme for mapping low-quality retinal CFPs to high-quality counterparts. Furthermore, to improve the flexibility, robustness, and applicability of our image enhancement pipeline in the clinical practice, we generalized a state-of-the-art model-based image reconstruction method, regularization by denoising, by plugging in priors learned by our OT-guided image-to-image translation network. We named it as regularization by enhancing (RE). We validated the integrated framework, OTRE, on three publicly available retinal image datasets by assessing the quality after enhancement and their performance on various downstream tasks, including diabetic retinopathy grading, vessel segmentation, and diabetic lesion segmentation. The experimental results demonstrated the superiority of our proposed framework over some state-of-the-art unsupervised competitors and a state-of-the-art supervised method.
△ Less
Submitted 8 April, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Optimal Transport Guided Unsupervised Learning for Enhancing low-quality Retinal Images
Authors:
Wenhui Zhu,
Peijie Qiu,
Mohammad Farazi,
Keshav Nandakumar,
Oana M. Dumitrascu,
Yalin Wang
Abstract:
Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we propo…
▽ More
Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we proposed an unpaired image-to-image translation scheme for transporting low-quality images to their high-quality counterparts. We theoretically proved that a Generative Adversarial Networks (GAN) model with a generator and discriminator is sufficient for this task. Furthermore, to mitigate the inconsistency of information between the low-quality images and their enhancements, an information consistency mechanism was proposed to maximally maintain structural consistency (optical discs, blood vessels, lesions) between the source and enhanced domains. Extensive experiments were conducted on the EyeQ dataset to demonstrate the superiority of our proposed method perceptually and quantitatively.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
In-situ monitoring additive manufacturing process with AI edge computing
Authors:
Wenkang Zhu,
Hui Li,
Yikai Zhang,
Yuqing Hou,
Liwei Chen
Abstract:
In-situ monitoring system can be used to monitor the quality of additive manufacturing (AM) processes. In the case of digital image correlation (DIC) based in-situ monitoring systems, high-speed cameras were used to capture images of high resolutions. This paper proposed a novel in-situ monitoring system to accelerate the process of digital images using artificial intelligence (AI) edge computing…
▽ More
In-situ monitoring system can be used to monitor the quality of additive manufacturing (AM) processes. In the case of digital image correlation (DIC) based in-situ monitoring systems, high-speed cameras were used to capture images of high resolutions. This paper proposed a novel in-situ monitoring system to accelerate the process of digital images using artificial intelligence (AI) edge computing board. It built a visual transformer based video super resolution (ViTSR) network to reconstruct high resolution (HR) videos frames. Fully convolutional network (FCN) was used to simultaneously extract the geometric characteristics of molten pool and plasma arc during the AM processes. Compared with 6 state-of-the-art super resolution methods, ViTSR ranks first in terms of peak signal to noise ratio (PSNR). The PSNR of ViTSR for 4x super resolution reached 38.16 dB on test data with input size of 75 pixels x 75 pixels. Inference time of ViTSR and FCN was optimized to 50.97 ms and 67.86 ms on AI edge board after operator fusion and model pruning. The total inference time of the proposed system was 118.83 ms, which meets the requirement of real-time quality monitoring with low cost in-situ monitoring equipment during AM processes. The proposed system achieved an accuracy of 96.34% on the multi-objects extraction task and can be applied to different AM processes.
△ Less
Submitted 2 January, 2023;
originally announced January 2023.
-
An Adapter based Multi-label Pre-training for Speech Separation and Enhancement
Authors:
Tianrui Wang,
Xie Chen,
Zhuo Chen,
Shu Yu,
Weibin Zhu
Abstract:
In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improv…
▽ More
In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first update HuBERT's masked speech prediction (MSP) objective by integrating the separation and denoising terms, resulting in a multiple pseudo label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the multiple pseudo label MSP while other parameters remain frozen as default HuBERT. Experimental results show that our proposed adapter-based multiple pseudo label HuBERT yield consistent and significant performance improvements on SE, SS, and ASR tasks, with a faster pre-training speed, at only marginal parameters increase.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
A Self-Supervised Approach to Reconstruction in Sparse X-Ray Computed Tomography
Authors:
Rey Mendoza,
Minh Nguyen,
Judith Weng Zhu,
Vincent Dumont,
Talita Perciano,
Juliane Mueller,
Vidya Ganapati
Abstract:
Computed tomography has propelled scientific advances in fields from biology to materials science. This technology allows for the elucidation of 3-dimensional internal structure by the attenuation of x-rays through an object at different rotations relative to the beam. By imaging 2-dimensional projections, a 3-dimensional object can be reconstructed through a computational algorithm. Imaging at a…
▽ More
Computed tomography has propelled scientific advances in fields from biology to materials science. This technology allows for the elucidation of 3-dimensional internal structure by the attenuation of x-rays through an object at different rotations relative to the beam. By imaging 2-dimensional projections, a 3-dimensional object can be reconstructed through a computational algorithm. Imaging at a greater number of rotation angles allows for improved reconstruction. However, taking more measurements increases the x-ray dose and may cause sample damage. Deep neural networks have been used to transform sparse 2-D projection measurements to a 3-D reconstruction by training on a dataset of known similar objects. However, obtaining high-quality object reconstructions for the training dataset requires high x-ray dose measurements that can destroy or alter the specimen before imaging is complete. This becomes a chicken-and-egg problem: high-quality reconstructions cannot be generated without deep learning, and the deep neural network cannot be learned without the reconstructions. This work develops and validates a self-supervised probabilistic deep learning technique, the physics-informed variational autoencoder, to solve this problem. A dataset consisting solely of sparse projection measurements from each object is used to jointly reconstruct all objects of the set. This approach has the potential to allow visualization of fragile samples with x-ray computed tomography. We release our code for reproducing our results at: https://github.com/vganapati/CT_PVAE .
△ Less
Submitted 29 October, 2022;
originally announced November 2022.
-
Message Passing-Based Joint User Activity Detection and Channel Estimation for Temporally-Correlated Massive Access
Authors:
Weifeng Zhu,
Meixia Tao,
Xiaojun Yuan,
Yunfeng Guan
Abstract:
This paper studies the user activity detection and channel estimation problem in a temporally-correlated massive access system where a very large number of users communicate with a base station sporadically and each user once activated can transmit with a large probability over multiple consecutive frames. We formulate the problem as a dynamic compressed sensing (DCS) problem to exploit both the s…
▽ More
This paper studies the user activity detection and channel estimation problem in a temporally-correlated massive access system where a very large number of users communicate with a base station sporadically and each user once activated can transmit with a large probability over multiple consecutive frames. We formulate the problem as a dynamic compressed sensing (DCS) problem to exploit both the sparsity and the temporal correlation of user activity. By leveraging the hybrid generalized approximate message passing (HyGAMP) framework, we design a computationally efficient algorithm, HyGAMP-DCS, to solve this problem. In contrast to only exploit the historical estimations, the proposed algorithm performs bidirectional message passing between the neighboring frames for activity likelihood update to fully exploit the temporally-correlated user activities. Furthermore, we develop an expectation maximization HyGAMP-DCS (EM-HyGAMP-DCS) algorithm to adaptively learn the hyperparameters during the estimation procedure when the system statistics are unknown. In particular, we propose to utilize the analysis tool of state evolution to find the appropriate hyperparameter initialization of EM-HyGAMP-DCS. Simulation results demonstrate that our proposed algorithms can significantly improve the user activity detection accuracy and reduce the channel estimation error.
△ Less
Submitted 26 January, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Speech Dereverberation with a Reverberation Time Shortening Target
Authors:
Rui Zhou,
Wenye Zhu,
Xiaofei Li
Abstract:
This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation a…
▽ More
This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation and meanwhile maintains the exponential decaying property of reverberation, which will ease the network training, and thus reduce signal distortion caused by the prediction error. Moreover, this work experimentally study to adapt our previously proposed FullSubNet speech denoising network to speech dereverberation. Experiments show that RTS is a more suitable learning target than direct-path speech and early reflections, in terms of better suppressing reverberation and signal distortion. FullSubNet is able to achieve outstanding dereverberation performance.
△ Less
Submitted 5 June, 2023; v1 submitted 20 October, 2022;
originally announced October 2022.
-
spatial-dccrn: dccrn equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement
Authors:
Shubo Lv,
Yihui Fu,
Yukai Jv,
Lei Xie,
Weixin Zhu,
Wei Rao,
Yannan Wang
Abstract:
Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded su…
▽ More
Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded sub-channel and full-channel processing strategy, which can model different channels separately. Moreover, instead of only adopting multi-channel spectrum or concatenating first-channel's magnitude and IPD as the model's inputs, we apply an angle feature extraction module (AFE) to extract frame-level angle feature embeddings, which can help the model to apparently perceive spatial information. Finally, since the phenomenon of residual noise will be more serious when the noise and speech exist in the same time frequency (TF) bin, we particularly design a masking and mapping filtering method to substitute the traditional filter-and-sum operation, with the purpose of cascading coarsely denoising, dereverberation and residual noise suppression. The proposed model, Spatial-DCCRN, has surpassed EaBNet, FasNet as well as several competitive models on the L3DAS22 Challenge dataset. Not only the 3D scenario, Spatial-DCCRN outperforms state-of-the-art (SOTA) model MIMO-UNet by a large margin in multiple evaluation metrics on the multi-channel ConferencingSpeech2021 Challenge dataset. Ablation studies also demonstrate the effectiveness of different contributions.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Self-Supervised Equivariant Regularization Reconciles Multiple Instance Learning: Joint Referable Diabetic Retinopathy Classification and Lesion Segmentation
Authors:
Wenhui Zhu,
Peijie Qiu,
Natasha Lepore,
Oana M. Dumitrascu,
Yalin Wang
Abstract:
Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning an…
▽ More
Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning and attention-based multi-instance learning (MIL) to tackle this problem. MIL is an effective strategy to differentiate positive and negative instances, helping us discard background regions (negative instances) while localizing lesion regions (positive ones). However, MIL only provides coarse lesion localization and cannot distinguish lesions located across adjacent patches. Conversely, a self-supervised equivariant attention mechanism (SEAM) generates a segmentation-level class activation map (CAM) that can guide patch extraction of lesions more accurately. Our work aims at integrating both methods to improve rDR classification accuracy. We conduct extensive validation experiments on the Eyepacs dataset, achieving an area under the receiver operating characteristic curve (AU ROC) of 0.958, outperforming current state-of-the-art algorithms.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Deep Learning for Hierarchical Beam Alignment in mmWave Communication Systems
Authors:
Junyi Yang,
Weifeng Zhu,
Meixia Tao
Abstract:
Fast and precise beam alignment is crucial to support high-quality data transmission in millimeter wave (mmWave) communication systems. In this work, we propose a novel deep learning based hierarchical beam alignment method that learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine searching manner. Specifically, the proposed method…
▽ More
Fast and precise beam alignment is crucial to support high-quality data transmission in millimeter wave (mmWave) communication systems. In this work, we propose a novel deep learning based hierarchical beam alignment method that learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine searching manner. Specifically, the proposed method first performs coarse channel measurement using the tier-1 PC, then selects a tier-2 PC for fine channel measurement, and finally predicts the optimal beam based on both coarse and fine measurements. The proposed deep neural network (DNN) architecture is trained in two steps. First, the tier-1 PC and the tier-2 PC selector are trained jointly. After that, all the tier-2 PCs together with the optimal beam predictors are trained jointly. The learned hierarchical PCs can capture the features of propagation environment. Numerical results based on realistic ray-tracing datasets demonstrate that the proposed method is superior to the state-of-art beam alignment methods in both alignment accuracy and sweeping overhead.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
A No-Reference Deep Learning Quality Assessment Method for Super-resolution Images Based on Frequency Maps
Authors:
Zicheng Zhang,
Wei Sun,
Xiongkuo Min,
Wenhan Zhu,
Tao Wang,
Wei Lu,
Guangtao Zhai
Abstract:
To support the application scenarios where high-resolution (HR) images are urgently needed, various single image super-resolution (SISR) algorithms are developed. However, SISR is an ill-posed inverse problem, which may bring artifacts like texture shift, blur, etc. to the reconstructed images, thus it is necessary to evaluate the quality of super-resolution images (SRIs). Note that most existing…
▽ More
To support the application scenarios where high-resolution (HR) images are urgently needed, various single image super-resolution (SISR) algorithms are developed. However, SISR is an ill-posed inverse problem, which may bring artifacts like texture shift, blur, etc. to the reconstructed images, thus it is necessary to evaluate the quality of super-resolution images (SRIs). Note that most existing image quality assessment (IQA) methods were developed for synthetically distorted images, which may not work for SRIs since their distortions are more diverse and complicated. Therefore, in this paper, we propose a no-reference deep-learning image quality assessment method based on frequency maps because the artifacts caused by SISR algorithms are quite sensitive to frequency information. Specifically, we first obtain the high-frequency map (HM) and low-frequency map (LM) of SRI by using Sobel operator and piecewise smooth image approximation. Then, a two-stream network is employed to extract the quality-aware features of both frequency maps. Finally, the features are regressed into a single quality value using fully connected layers. The experimental results show that our method outperforms all compared IQA models on the selected three super-resolution quality assessment (SRQA) databases.
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
Double-Sided Information Aided Temporal-Correlated Massive Access
Authors:
Weifeng Zhu,
Meixia Tao,
Yunfeng Guan
Abstract:
This letter considers temporal-correlated massive access, where each device, once activated, is likely to transmit continuously over several consecutive frames. Motivated by that the device activity at each frame is correlated to not only its previous frame but also its next frame, we propose a double-sided information (DSI) aided joint activity detection and channel estimation algorithm based on…
▽ More
This letter considers temporal-correlated massive access, where each device, once activated, is likely to transmit continuously over several consecutive frames. Motivated by that the device activity at each frame is correlated to not only its previous frame but also its next frame, we propose a double-sided information (DSI) aided joint activity detection and channel estimation algorithm based on the approximate message passing (AMP) framework. The DSI is extracted from the estimation results in a sliding window that contains the target detection frame and its previous and next frames. The proposed algorithm demonstrates superior performance over the state-of-the-art methods.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Speech Dereverberation with A Reverberation Time Shortening Target
Authors:
Rui Zhou,
Wenye Zhu,
Xiaofei Li
Abstract:
This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation a…
▽ More
This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation and meanwhile maintains the exponential decaying property of reverberation, which will ease the network training, and thus reduce signal distortion caused by the prediction error. Moreover, this work experimentally study to adapt our previously proposed FullSubNet speech denoising network to speech dereverberation. Experiments show that RTS is a more suitable learning target than direct-path speech and early reflections, in terms of better suppressing reverberation and signal distortion. FullSubNet is able to achieve outstanding dereverberation performance.
△ Less
Submitted 20 November, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation
Authors:
Wenjing Zhu,
Xiang Li
Abstract:
Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks~(CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be wel…
▽ More
Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks~(CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be well captured due to the limits of existing CNNs for SER. In this paper, we propose a novel GLobal-Aware Multi-scale (GLAM) neural network (The code is available at https://github.com/lixiangucas01/GLAM) to learn multi-scale feature representation with global-aware fusion module to attend emotional information. Specifically, GLAM iteratively utilizes multiple convolutional kernels with different scales to learn multiple feature representation. Then, instead of using attention-based methods, a simple but effective global-aware fusion module is applied to grab most important emotional information globally. Experiments on the benchmark corpus IEMOCAP over four emotions demonstrates the superiority of our proposed model with 2.5% to 4.5% improvements on four common metrics compared to previous state-of-the-art approaches.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Multiple Confidence Gates For Joint Training Of SE And ASR
Authors:
Tianrui Wang,
Weibin Zhu,
Yingying Gao,
Junlan Feng,
Shilei Zhang
Abstract:
Joint training of speech enhancement model (SE) and speech recognition model (ASR) is a common solution for robust ASR in noisy environments. SE focuses on improving the auditory quality of speech, but the enhanced feature distribution is changed, which is uncertain and detrimental to the ASR. To tackle this challenge, an approach with multiple confidence gates for jointly training of SE and ASR i…
▽ More
Joint training of speech enhancement model (SE) and speech recognition model (ASR) is a common solution for robust ASR in noisy environments. SE focuses on improving the auditory quality of speech, but the enhanced feature distribution is changed, which is uncertain and detrimental to the ASR. To tackle this challenge, an approach with multiple confidence gates for jointly training of SE and ASR is proposed. A speech confidence gates prediction module is designed to replace the former SE module in joint training. The noisy speech is filtered by gates to obtain features that are easier to be fitting by the ASR network. The experimental results show that the proposed method has better performance than the traditional robust speech recognition system on test sets of clean speech, synthesized noisy speech, and real noisy speech.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
Intelligent resonance tracking of a microwave plasmonic resonator for compact wireless sensors
Authors:
Xuanru Zhang,
Jia Wen Zhu,
Tie Jun Cui
Abstract:
Plasmonic sensing has been in the spotlight for decades, the concept and applications of which have been generalized to spoof surface plasmons (SSPs) in the microwave band. Here, we report a compact and wireless sensor within a printed circuit board size of 18 mm * 12 mm, tracking the resonance frequency shift of a microwave plasmonic resonator via a software-defined scheme. The microwave plasmoni…
▽ More
Plasmonic sensing has been in the spotlight for decades, the concept and applications of which have been generalized to spoof surface plasmons (SSPs) in the microwave band. Here, we report a compact and wireless sensor within a printed circuit board size of 18 mm * 12 mm, tracking the resonance frequency shift of a microwave plasmonic resonator via a software-defined scheme. The microwave plasmonic resonator yields a deep-subwavelength size, enhanced sensitivity, and a good electromagnetic compatibility performance. The software-defined resonance tracking scheme minimalizes the hardware circuit and the consumed spectrum resources, and makes the detection intelligently adaptive to the target resonance, with a signal-to-noise ratio of 69 dB and a data rate of 2272 measuring points per second. The sensor has been validated via acetone vapor concentration sensing, while its applications can be widely extended by replacing the transducer materials. This approach provides compact, sensitive, accurate and intelligent solutions for resonant sensors in the Internet of things (IoT).
△ Less
Submitted 2 March, 2022;
originally announced March 2022.