-
A Comprehensive Survey on EEG-Based Emotion Recognition: A Graph-Based Perspective
Authors:
Chenyu Liu,
Xinliang Zhou,
Yihao Wu,
Yi Ding,
Liming Zhai,
Kun Wang,
Ziyu Jia,
Yang Liu
Abstract:
Compared to other modalities, electroencephalogram (EEG) based emotion recognition can intuitively respond to emotional patterns in the human brain and, therefore, has become one of the most focused tasks in affective computing. The nature of emotions is a physiological and psychological state change in response to brain region connectivity, making emotion recognition focus more on the dependency…
▽ More
Compared to other modalities, electroencephalogram (EEG) based emotion recognition can intuitively respond to emotional patterns in the human brain and, therefore, has become one of the most focused tasks in affective computing. The nature of emotions is a physiological and psychological state change in response to brain region connectivity, making emotion recognition focus more on the dependency between brain regions instead of specific brain regions. A significant trend is the application of graphs to encapsulate such dependency as dynamic functional connections between nodes across temporal and spatial dimensions. Concurrently, the neuroscientific underpinnings behind this dependency endow the application of graphs in this field with a distinctive significance. However, there is neither a comprehensive review nor a tutorial for constructing emotion-relevant graphs in EEG-based emotion recognition. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of graph-related methods in this field from a methodological perspective. We propose a unified framework for graph applications in this field and categorize these methods on this basis. Finally, based on previous studies, we also present several open challenges and future directions in this field.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images
Authors:
Shouyue Liu,
Jinkui Hao,
Yonghuai Liu,
Huazhu Fu,
Xinyu Guo,
Shuting Zhang,
Yitian Zhao
Abstract:
Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryologic…
▽ More
Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to excavate the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
Secure Transmission for Movable Antennas Empowered Cell-Free Symbiotic Radio Communications
Authors:
Jiayu Guan,
Bin Lyu,
Yan Liu,
Feng Tian
Abstract:
In this paper, a novel movable antenna (MA) empowered secure transmission scheme is designed for cell-free symbiotic radio (SR) systems in the presence of an eavesdropper (Eve). Specifically, multiple distributed access points (APs) equipped with MAs collaboratively transmit confidential information to the primary user (PU), in the meanwhile the backscatter device (BD) transmits its own informatio…
▽ More
In this paper, a novel movable antenna (MA) empowered secure transmission scheme is designed for cell-free symbiotic radio (SR) systems in the presence of an eavesdropper (Eve). Specifically, multiple distributed access points (APs) equipped with MAs collaboratively transmit confidential information to the primary user (PU), in the meanwhile the backscatter device (BD) transmits its own information to the secondary user (SU) by reflecting incident signals from the APs. The MAs deployed at the APs can adjust their positions flexibly to improve channel conditions between the APs and the PU/SU/BD and suppress the eavesdropping from the Eve on confidential information at the PU. Under this setup, we maximize the secrecy rate of primary transmission through jointly optimizing the APs' transmission beamforming vectors and the positions of the MAs, while adhering to the quality of service constraints at the SU. To address the challenges caused by the non-convexity and find a near-optimal solution, an alternating optimization (AO) framework is proposed, utilizing the successive convex approximation method, the semi-definite relaxation technology and a genetic algorithm modified particle swarm optimization (GA-PSO) algorithm. Numerical results demonstrate the secrecy rate enhancement provided by utilizing the MAs and show the impact of the GA-PSO algorithm for improving the solving accuracy.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Near-Field Sensing Enabled Predictive Beamforming: From Estimation to Tracking
Authors:
Hao Jiang,
Zhaolin Wang,
Yuanwei Liu
Abstract:
A near-field sensing (NISE) enabled predictive beamforming framework is proposed to facilitate wireless communications with high-mobility channels. Unlike conventional far-field sensing, which only captures the angle and the radial velocity of the user, NISE enables the estimation of the full motion state, including additional distance and transverse velocity information. Two full-motion state sen…
▽ More
A near-field sensing (NISE) enabled predictive beamforming framework is proposed to facilitate wireless communications with high-mobility channels. Unlike conventional far-field sensing, which only captures the angle and the radial velocity of the user, NISE enables the estimation of the full motion state, including additional distance and transverse velocity information. Two full-motion state sensing approaches are proposed based on the concepts of estimation and tracking, respectively. 1)AGD-AO approach: The full motion state of the user is estimated within a single CPI. In particular, the gradient descent is adopted to estimate the transverse and radial velocities of the user based on the maximum likelihood criteria, while the distance and the angle are calculated by the kinematic model. In this process, moment estimations are leveraged to adaptively tune the step size, thereby leading to a smoother and faster gradient descent. 2) EKF approach: The full motion state of the user is tracked across multiple CPIs. Based on the noisy measurements in multiple CPIs, the EKF iteratively predicts and updates the current motion state to achieve a low tracking error. Based on the obtained full motion state, the beam prediction, and Doppler frequency compensation can be carried out with minimum pilot overhead. Numerical results are provided to validate the effectiveness and efficiency of the proposed approach compared to the conventional far-field predictive beamforming and feedback-based approaches. It is also revealed that: 1)the proposed AGD-AO can achieve stable descending with small gradients, thereby accelerating convergence; 2) compared to far-field predictive beamforming and feedback-based schemes, both of the proposed methods exhibit superior performance; and 3) by incorporating multiple CPIs, the EKF method exhibits greater robustness in low SNR regions.
△ Less
Submitted 4 August, 2024;
originally announced August 2024.
-
Multibeam Hybrid Transmitarray Based on Polarization Rotating Metasurface With Reconfigurable Bidirectional Radiation
Authors:
Fan Qin,
Yifei Liu,
Chao Gu,
Linfeng Zeng,
Wenchi Cheng,
Hailin Zhang,
Steven Gao
Abstract:
This paper proposes a bidirectional multibeam hybrid transmitarray (HTA) employing a transmission polarization-rotating metasurface (TPRM). A novel configuration is introduced to facilitate bidirectional beam scanning by combining the transmitarray (TA) and folded-transmitarray (FTA). To accomplish the reconfiguration of both unidirectional and bidirectional radiation states in the +z, -z, and +/-…
▽ More
This paper proposes a bidirectional multibeam hybrid transmitarray (HTA) employing a transmission polarization-rotating metasurface (TPRM). A novel configuration is introduced to facilitate bidirectional beam scanning by combining the transmitarray (TA) and folded-transmitarray (FTA). To accomplish the reconfiguration of both unidirectional and bidirectional radiation states in the +z, -z, and +/-z directions, a polarization switchable multi-feed array (MFA) is placed at the focal plane between the TA and FTA, radiating x-polarization, y-polarization, and 45-degree oblique polarization waves, respectively. Meanwhile, the proposed antenna can achieve multibeam radiation in the three aforementioned states by switching the polarization of the MFA. To demonstrate the operating principle, a prototype has been designed, simulated, and fabricated. The measured results agree well with the simulated results. The simulated and measured results indicate that the proposed design can generate reconfigurable multibeam in both forward and backward directions, either separately or simultaneously. In the unidirectional states, forward and backward beam scanning is achieved within an angular range of +/-30° and +/-22°, respectively, with peak gains of 23.6 dBi and 23.1 dBi. A simultaneous forward and backward beam scanning of +/-40° and +/-22° is achieved in the hybrid radiation state, with peak gains of 19.4 dBi and 19.3 dBi, respectively. The proposed antenna array design offers several advantages, including bidirectional low-loss beam scanning, a simple structure, low power consumption, and a low profile.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Augmenting Channel Simulator and Semi- Supervised Learning for Efficient Indoor Positioning
Authors:
Yupeng Li,
Xinyu Ning,
Shijian Gao,
Yitong Liu,
Zhi Sun,
Qixing Wang,
Jiangzhou Wang
Abstract:
This work aims to tackle the labor-intensive and resource-consuming task of indoor positioning by proposing an efficient approach. The proposed approach involves the introduction of a semi-supervised learning (SSL) with a biased teacher (SSLB) algorithm, which effectively utilizes both labeled and unlabeled channel data. To reduce measurement expenses, unlabeled data is generated using an updated…
▽ More
This work aims to tackle the labor-intensive and resource-consuming task of indoor positioning by proposing an efficient approach. The proposed approach involves the introduction of a semi-supervised learning (SSL) with a biased teacher (SSLB) algorithm, which effectively utilizes both labeled and unlabeled channel data. To reduce measurement expenses, unlabeled data is generated using an updated channel simulator (UCHS), and then weighted by adaptive confidence values to simplify the tuning of hyperparameters. Simulation results demonstrate that the proposed strategy achieves superior performance while minimizing measurement overhead and training expense compared to existing benchmarks, offering a valuable and practical solution for indoor positioning.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
Joint Vehicle Connection and Beamforming Optimization in Digital Twin Assisted Integrated Sensing and Communication Vehicular Networks
Authors:
Weihang Ding,
Zhaohui Yang,
Mingzhe Chen,
Yuchen Liu,
Mohammad Shikh-Bahaei
Abstract:
This paper introduces an approach to harness digital twin (DT) technology in the realm of integrated sensing and communications (ISAC) in the sixth-generation (6G) Internet-of-everything (IoE) applications. We consider moving targets in a vehicular network and use DT to track and predict the motion of the vehicles. After predicting the location of the vehicle at the next time slot, the DT designs…
▽ More
This paper introduces an approach to harness digital twin (DT) technology in the realm of integrated sensing and communications (ISAC) in the sixth-generation (6G) Internet-of-everything (IoE) applications. We consider moving targets in a vehicular network and use DT to track and predict the motion of the vehicles. After predicting the location of the vehicle at the next time slot, the DT designs the assignment and beamforming for each vehicle. The real time sensing information is then utilized to update and refine the DT, enabling further processing and decision-making. This model incorporates a dynamic Kalman gain, which is updated at each time slot based on the received echo signals. The state representation encompasses both vehicle motion information and the error matrix, with the posterior Cramér-Rao bound (PCRB) employed to assess sensing accuracy. We consider a network with two roadside units (RSUs), and the vehicles need to be allocated to one of them. To optimize the overall transmission rate while maintaining an acceptable sensing accuracy, an optimization problem is formulated. Since it is generally hard to solve the original problem, Lagrange multipliers and fractional programming are employed to simplify this optimization problem. To solve the simplified problem, this paper introduces both greedy and heuristic algorithms through optimizing both vehicle assignments and predictive beamforming. The optimized results are then transferred back to the real space for ISAC applications. Recognizing the computational complexity of the greedy and heuristic algorithms, a bidirectional long short-term memory (LSTM)-based recurrent neural network (RNN) is proposed for efficient beamforming design within the DT. Simulation results demonstrate the effectiveness of the DT-based ISAC network.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
An Efficient Convex-Hull Relaxation Based Algorithm for Multi-User Discrete Passive Beamforming
Authors:
Wenhai Lai,
Zheyu Wu,
Yi Feng,
Kaiming Shen,
Ya-Feng Liu
Abstract:
Intelligent reflecting surface (IRS) is an emerging technology to enhance spatial multiplexing in wireless networks. This letter considers the discrete passive beamforming design for IRS in order to maximize the minimum signal-to-interference-plus-noise ratio (SINR) among multiple users in an IRS-assisted downlink network. The main design difficulty lies in the discrete phase-shift constraint. Dif…
▽ More
Intelligent reflecting surface (IRS) is an emerging technology to enhance spatial multiplexing in wireless networks. This letter considers the discrete passive beamforming design for IRS in order to maximize the minimum signal-to-interference-plus-noise ratio (SINR) among multiple users in an IRS-assisted downlink network. The main design difficulty lies in the discrete phase-shift constraint. Differing from most existing works, this letter advocates a convex-hull relaxation of the discrete constraints which leads to a continuous reformulated problem equivalent to the original discrete problem. This letter further proposes an efficient alternating projection/proximal gradient descent and ascent algorithm for solving the reformulated problem. Simulation results show that the proposed algorithm outperforms the state-of-the-art methods significantly.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Suppressing Beam Squint Effect For Near-Field Wideband Communication Through Movable Antennas
Authors:
Yanze Zhu,
Qingqing Wu,
Yang Liu,
Qingjiang Shi,
Wen Chen
Abstract:
In this correspondence, we study deploying movable antenna (MA) array in a wideband multiple-input-single-output (MISO) communication system, where near-field (NF) channel model is considered. To alleviate beam squint effect, we propose to maximize the minimum analog beamforming gain across the entire wideband spectrum by appropriately adjusting MAs' positions, which is a highly challenging task.…
▽ More
In this correspondence, we study deploying movable antenna (MA) array in a wideband multiple-input-single-output (MISO) communication system, where near-field (NF) channel model is considered. To alleviate beam squint effect, we propose to maximize the minimum analog beamforming gain across the entire wideband spectrum by appropriately adjusting MAs' positions, which is a highly challenging task. By introducing a slack variable and adopting the cutting-the-edge smoothed-gradient-descent-ascent (SGDA) method, we develop algorithms to resolve the aforementioned challenge. Numerical results verify the effectiveness of our proposed algorithms and demonstrate the benefit of utilizing MA array to mitigate beam squint effect in NF wideband system.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Integrating Posture Control in Speech Motor Models: A Parallel-Structured Simulation Approach
Authors:
Yadong Liu,
Sidney Fels,
Arian Shamei,
Najeeb Khan,
Bryan Gick
Abstract:
Posture is an essential aspect of motor behavior, necessitating continuous muscle activation to counteract gravity. It remains stable under perturbation, aiding in maintaining bodily balance and enabling movement execution. Similarities have been observed between gross body postures and speech postures, such as those involving the jaw, tongue, and lips, which also exhibit resilience to perturbatio…
▽ More
Posture is an essential aspect of motor behavior, necessitating continuous muscle activation to counteract gravity. It remains stable under perturbation, aiding in maintaining bodily balance and enabling movement execution. Similarities have been observed between gross body postures and speech postures, such as those involving the jaw, tongue, and lips, which also exhibit resilience to perturbations and assist in equilibrium and movement. Although postural control is a recognized element of human movement and balance, particularly in broader motor skills, it has not been adequately incorporated into existing speech motor control models, which typically concentrate on the gestures or motor commands associated with specific speech movements, overlooking the influence of postural control and gravity. Here we introduce a model that aligns speech posture and movement, using simulations to explore whether speech posture within this framework mirrors the principles of bodily postural control. Our findings indicate that, akin to body posture, speech posture is also robust to perturbation and plays a significant role in maintaining local segment balance and enhancing speech production.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Speech Editing -- a Summary
Authors:
Tobias Kässmann,
Yining Liu,
Danni Liu
Abstract:
With the rise of video production and social media, speech editing has become crucial for creators to address issues like mispronunciations, missing words, or stuttering in audio recordings. This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. These approaches ensure edited audio is indistinguishable from the original by alte…
▽ More
With the rise of video production and social media, speech editing has become crucial for creators to address issues like mispronunciations, missing words, or stuttering in audio recordings. This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing. These approaches ensure edited audio is indistinguishable from the original by altering the mel-spectrogram. Recent advancements, such as context-aware prosody correction and advanced attention mechanisms, have improved speech editing quality. This paper reviews state-of-the-art methods, compares key metrics, and examines widely used datasets. The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
APS-USCT: Ultrasound Computed Tomography on Sparse Data via AI-Physic Synergy
Authors:
Yi Sheng,
Hanchen Wang,
Yipei Liu,
Junhuan Yang,
Weiwen Jiang,
Youzuo Lin,
Lei Yang
Abstract:
Ultrasound computed tomography (USCT) is a promising technique that achieves superior medical imaging reconstruction resolution by fully leveraging waveform information, outperforming conventional ultrasound methods. Despite its advantages, high-quality USCT reconstruction relies on extensive data acquisition by a large number of transducers, leading to increased costs, computational demands, exte…
▽ More
Ultrasound computed tomography (USCT) is a promising technique that achieves superior medical imaging reconstruction resolution by fully leveraging waveform information, outperforming conventional ultrasound methods. Despite its advantages, high-quality USCT reconstruction relies on extensive data acquisition by a large number of transducers, leading to increased costs, computational demands, extended patient scanning times, and manufacturing complexities. To mitigate these issues, we propose a new USCT method called APS-USCT, which facilitates imaging with sparse data, substantially reducing dependence on high-cost dense data acquisition. Our APS-USCT method consists of two primary components: APS-wave and APS-FWI. The APS-wave component, an encoder-decoder system, preprocesses the waveform data, converting sparse data into dense waveforms to augment sample density prior to reconstruction. The APS-FWI component, utilizing the InversionNet, directly reconstructs the speed of sound (SOS) from the ultrasound waveform data. We further improve the model's performance by incorporating Squeeze-and-Excitation (SE) Blocks and source encoding techniques. Testing our method on a breast cancer dataset yielded promising results. It demonstrated outstanding performance with an average Structural Similarity Index (SSIM) of 0.8431. Notably, over 82% of samples achieved an SSIM above 0.8, with nearly 61% exceeding 0.85, highlighting the significant potential of our approach in improving USCT image reconstruction by efficiently utilizing sparse data.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
Authors:
Lukuan Dong,
Donghong Qin,
Fengbo Bai,
Fanhua Song,
Yan Liu,
Chen Xu,
Zhijian Ou
Abstract:
The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense…
▽ More
The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Disturbance Observer for Estimating Coupled Disturbances
Authors:
Jindou Jia,
Yuhang Liu,
Kexin Guo,
Xiang Yu,
Lihua Xie,
Lei Guo
Abstract:
High-precision control for nonlinear systems is impeded by the low-fidelity dynamical model and external disturbance. Especially, the intricate coupling between internal uncertainty and external disturbance is usually difficult to be modeled explicitly. Here we show an effective and convergent algorithm enabling accurate estimation of the coupled disturbance via combining control and learning phil…
▽ More
High-precision control for nonlinear systems is impeded by the low-fidelity dynamical model and external disturbance. Especially, the intricate coupling between internal uncertainty and external disturbance is usually difficult to be modeled explicitly. Here we show an effective and convergent algorithm enabling accurate estimation of the coupled disturbance via combining control and learning philosophies. Specifically, by resorting to Chebyshev series expansion, the coupled disturbance is firstly decomposed into an unknown parameter matrix and two known structures depending on system state and external disturbance respectively. A Regularized Least Squares (RLS) algorithm is subsequently formalized to learn the parameter matrix by using historical time-series data. Finally, a higher-order disturbance observer (HODO) is developed to achieve a high-precision estimation of the coupled disturbance by utilizing the learned portion. The efficiency of the proposed algorithm is evaluated through extensive simulations. We believe this work can offer a new option to merge learning schemes into the control framework for addressing existing intractable control problems.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
MEDIC: Zero-shot Music Editing with Disentangled Inversion Control
Authors:
Huadai Liu,
Jialei Wang,
Rongjie Huang,
Yang Liu,
Jiayang Xu,
Zhou Zhao
Abstract:
Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accu…
▽ More
Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accumulation of errors across each diffusion step, undermining its efficacy. And the lack of attention control hinders the fine-grained manipulations of music. To counteract these limitations, we introduce the \textit{Disentangled Inversion} technique, which is designed to disentangle the diffusion process into triple branches, thereby magnifying their individual capabilities for both precise editing and preservation. Furthermore, we propose the \textit{Harmonized Attention Control} framework, which unifies the mutual self-attention and cross-attention with an additional Harmonic Branch to achieve the desired composition and structural information in the target music. Collectively, these innovations comprise the \textit{Disentangled Inversion Control (DIC)} framework, enabling accurate music editing whilst safeguarding structural integrity. To benchmark audio editing efficacy, we introduce \textit{ZoME-Bench}, a comprehensive music editing benchmark hosting 1,100 samples spread across 10 distinct editing categories, which facilitates both zero-shot and instruction-based music editing tasks. Our method demonstrates unparalleled performance in edit fidelity and essential content preservation, outperforming contemporary state-of-the-art inversion techniques.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
ICAGC 2024: Inspirational and Convincing Audio Generation Challenge 2024
Authors:
Ruibo Fu,
Rui Liu,
Chunyu Qiang,
Yingming Gao,
Yi Lu,
Shuchen Shi,
Tao Wang,
Ya Li,
Zhengqi Wen,
Chen Zhang,
Hui Bu,
Yukun Liu,
Xin Qi,
Guanjun Li
Abstract:
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective percept…
▽ More
The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective perception in practical applications like companion robots for children and marketing bots. The core issue lies in the inconsistency between high-quality audio generation and the ultimate human subjective experience. Therefore, this challenge aims to enhance the persuasiveness and acceptability of synthesized audio, focusing on human alignment convincing and inspirational audio generation. A total of 19 teams have registered for the challenge, and the results of the competition and the competition are described in this paper.
△ Less
Submitted 31 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
One-Bit MIMO Detection: From Global Maximum-Likelihood Detector to Amplitude Retrieval Approach
Authors:
Mingjie Shao,
Wei-Kun Chen,
Cheng-Yang Yu,
Ya-Feng Liu,
Wing-Kin Ma
Abstract:
As communication systems advance towards the future 6G era, the incorporation of large-scale antenna arrays in base stations (BSs) presents challenges such as increased hardware costs and energy consumption. To address these issues, the use of one-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs) has gained significant attentions. This paper focuses on one-bit multiple-in…
▽ More
As communication systems advance towards the future 6G era, the incorporation of large-scale antenna arrays in base stations (BSs) presents challenges such as increased hardware costs and energy consumption. To address these issues, the use of one-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs) has gained significant attentions. This paper focuses on one-bit multiple-input multiple-output (MIMO) detection in an uplink multiuser transmission scenario where the BS employs one-bit ADCs. One-bit quantization retains only the sign information and loses the amplitude information, which poses a unique challenge in the corresponding detection problem. The maximum-likelihood (ML) formulation of one-bit MIMO detection has a challenging likelihood function that hinders the application of many high-performance detectors developed for classic MIMO detection (under high-resolution ADCs). While many approximate methods for the ML detection problem have been studied, it lacks an efficient global algorithm. This paper fills this gap by proposing an efficient branch-and-bound algorithm, which is guaranteed to find the global solution of the one-bit ML MIMO detection problem. Additionally, a new amplitude retrieval (AR) detection approach is developed, incorporating explicit amplitude variables into the problem formulation. The AR approach yields simpler objective functions that enable the development of efficient algorithms offering both global and approximate solutions. The paper also contributes to the computational complexity analysis of both ML and AR detection problems. Extensive simulations are conducted to demonstrate the effectiveness and efficiency of the proposed formulations and algorithms.
△ Less
Submitted 16 July, 2024; v1 submitted 13 July, 2024;
originally announced July 2024.
-
Automated high-resolution backscattered-electron imaging at macroscopic scale
Authors:
Zhiyuan Lang,
Zunshuai Zhang,
Lei Wang,
Yuhan Liu,
Weixiong Qian,
Shenghua Zhou,
Ying Jiang,
Tongyi Zhang,
Jiong Yang
Abstract:
Scanning electron microscopy (SEM) has been widely utilized in the field of materials science due to its significant advantages, such as large depth of field, wide field of view, and excellent stereoscopic imaging. However, at high magnification, the limited imaging range in SEM cannot cover all the possible inhomogeneous microstructures. In this research, we propose a novel approach for generatin…
▽ More
Scanning electron microscopy (SEM) has been widely utilized in the field of materials science due to its significant advantages, such as large depth of field, wide field of view, and excellent stereoscopic imaging. However, at high magnification, the limited imaging range in SEM cannot cover all the possible inhomogeneous microstructures. In this research, we propose a novel approach for generating high-resolution SEM images across multiple scales, enabling a single image to capture physical dimensions at the centimeter level while preserving submicron-level details. We adopted the SEM imaging on the AlCoCrFeNi2.1 eutectic high entropy alloy (EHEA) as an example. SEM videos and image stitching are combined to fulfill this goal, and the video-extracted low-definition (LD) images are clarified by a well-trained denoising model. Furthermore, we segment the macroscopic image of the EHEA, and area of various microstructures are distinguished. Combining the segmentation results and hardness experiments, we found that the hardness is positively correlated with the content of body-centered cubic (BCC) phase, negatively correlated with the lamella width, and the relationship with the proportion of lamellar structures was not significant. Our work provides a feasible solution to generate macroscopic images based on SEMs for further analysis of the correlations between the microstructures and spatial distribution, and can be widely applied to other types of microscope.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Gravity Balanced Arm Exoskeleton for Basketball Shooting Training
Authors:
Yunfei Liu,
Zhanghao Yang
Abstract:
This paper proposes a gravity balanced arm exoskeleton design for basketball shooting training. The potential energy equation of the mechanism is derived. A simulation of the arm going through the basketball shooting motion is done on the mechanism. Throughout the motion the total potential energy is constant. Thus, the proposed arm exoskeleton is indeed gravity balanced with the use of two spring…
▽ More
This paper proposes a gravity balanced arm exoskeleton design for basketball shooting training. The potential energy equation of the mechanism is derived. A simulation of the arm going through the basketball shooting motion is done on the mechanism. Throughout the motion the total potential energy is constant. Thus, the proposed arm exoskeleton is indeed gravity balanced with the use of two springs.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Human Leg Training Machine Based on The Multi-linkage System
Authors:
Yunfei Liu,
Zhanghao Yang
Abstract:
In real life, many people have leg defects. the goal of our work is to design a mechanism which could help them walk based on a specific trajectory and realize flexible walking finally. In this paper, we use a motor to drive a multi-link leg mechanism. The major issues addressed in this paper are as follows: (i) design human leg training mechanism based on the multi-link mechanism (ii) Simulate le…
▽ More
In real life, many people have leg defects. the goal of our work is to design a mechanism which could help them walk based on a specific trajectory and realize flexible walking finally. In this paper, we use a motor to drive a multi-link leg mechanism. The major issues addressed in this paper are as follows: (i) design human leg training mechanism based on the multi-link mechanism (ii) Simulate leg movement trajectory of multi-link mechanism based on walking process (iii) make use of one motor torque control to control the trajectory and velocity of this mechanism.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Temperature Secret in Bathtub: A Model of Temperature Distribution of Bathtub Based on Heat Conduction Equation
Authors:
Yunfei Liu
Abstract:
We use the multidimensional heat conduction and heat transfer equations to model the temperature distribution of water in a bathtub by solving partial differential equations. We address optimal water addition and bathtub design. First, we establish a water surface cooling model using Newton's law of cooling to simulate heat exchange between air and water. Without new heat sources, the water temper…
▽ More
We use the multidimensional heat conduction and heat transfer equations to model the temperature distribution of water in a bathtub by solving partial differential equations. We address optimal water addition and bathtub design. First, we establish a water surface cooling model using Newton's law of cooling to simulate heat exchange between air and water. Without new heat sources, the water temperature reaches a minimum in 40 minutes. We then simulate adding hot water with a one-dimensional heat conduction model, including air cooling effects. We determine that the optimal heat input is 80 Joules and the optimal water velocity is 0.042 m/s to maintain temperature and save water. The ideal bathtub dimensions are 1.5m length, 0.6m width, 0.42m depth, with rounded corners. Using finite difference methods and MATLAB's Pdetool, we solve the heat conduction equation and verify numerical stability, discussing the model's pros and cons and suggesting improvements.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Perceived Time To Collision as Public Space Users' Discomfort Metric
Authors:
Alireza Jafari,
Yen-Chen Liu
Abstract:
Micro-mobility transport vehicles such as e-scooters are joining current sidewalk users and affect the safety and comfort of pedestrians as primary sidewalk users. The lack of agreed-upon metrics to quantify people's discomfort hinders shared public space safety research. We introduce perceived Time To Collision (TTC) as a potential metric of user discomfort performing controlled experiments using…
▽ More
Micro-mobility transport vehicles such as e-scooters are joining current sidewalk users and affect the safety and comfort of pedestrians as primary sidewalk users. The lack of agreed-upon metrics to quantify people's discomfort hinders shared public space safety research. We introduce perceived Time To Collision (TTC) as a potential metric of user discomfort performing controlled experiments using an e-scooter and a pedestrian moving in a hallway. The results strongly correlate the participant's reported discomfort and the perceived TTC. Therefore, TTC is a potential metric for public space users' discomfort. Since the metric only uses relative velocity and position information, it is a viable candidate for neighboring people's discomfort estimation in advanced driver assistance systems for e-scooters and PMVs. Our ongoing research extends the results to mobile robots.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Dynamic Modeling and Stability Analysis of Balancing in Riderless Electric Scooters
Authors:
Yun-Hao Lin,
Alireza Jafari,
Yen-Chen Liu
Abstract:
Today, electric scooter is a trendy personal mobility vehicle. The rising demand and opportunities attract ride-share services. A common problem of such services is abandoned e-scooters. An autonomous e-scooter capable of moving to the charging station is a solution. This paper focuses on maintaining balance for these riderless e-scooters. The paper presents a nonlinear model for an e-scooter movi…
▽ More
Today, electric scooter is a trendy personal mobility vehicle. The rising demand and opportunities attract ride-share services. A common problem of such services is abandoned e-scooters. An autonomous e-scooter capable of moving to the charging station is a solution. This paper focuses on maintaining balance for these riderless e-scooters. The paper presents a nonlinear model for an e-scooter moving with simultaneously varying speed and steering. A PD and a feedback-linearized PD controller stabilize the model. The stability analysis shows that the controllers are ultimately bounded even with parameter uncertainties and measurement inaccuracy. Simulations on a realistic e-scooter with a general demanding path to follow verify the ultimate boundedness of the controllers. In addition, the feedback-linearized PD controller outperforms the PD controller because it has narrower ultimate bounds. Future work focuses on experiments using a self-balancing mechanism installed on an e-scooter.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Physical encryption and decryption for secure data transmission in optical networks leveraging the temporal Talbot effect and microwave photonics
Authors:
Chulun Lin,
Taixia Shi,
Yiqing Liu,
Yang Chen
Abstract:
A novel microwave photonic scheme for secure data transmission in optical networks is proposed. The security of the scheme is guaranteed by physical encryption and decryption via the temporal Talbot effect in dispersive mediums. First, the original data is randomized in the digital domain by performing an exclusive OR operation using a random matrix. Subsequently, a time-varying multi-tone electri…
▽ More
A novel microwave photonic scheme for secure data transmission in optical networks is proposed. The security of the scheme is guaranteed by physical encryption and decryption via the temporal Talbot effect in dispersive mediums. First, the original data is randomized in the digital domain by performing an exclusive OR operation using a random matrix. Subsequently, a time-varying multi-tone electrical signal, which represents the randomized data matrix, is modulated onto an optical carrier. The optical signal after modulation is then phase-modulated by a temporal Talbot array illuminator (TAI) signal, and the optical signal after discrete quadratic phase modulation will lose its original appearance in the frequency domain and be further dispersed in the first dispersive medium. Due to the dispersion that does not match the TAI signal exactly, the waveform after the first dispersive medium is a noise-like signal. Hence, the physical encryption of the original data is successfully achieved. As the optical signal passes a second dispersive medium that makes the total dispersion match the TAI signal, the temporal waveform of the noise-like signal after photodetection is transformed into pulses. "1" and "0" in the randomized data matrix are represented through the presence and absence of pulses, and the physical decryption is achieved. By further processing the recovered data matrix using the random matrix, the original data can be recovered. The physical layer security of the proposed scheme and its fiber transmission capability are demonstrated. 8-Gbit/s data is transmitted, encrypted, and decrypted using two dispersive mediums and an optical fiber of 10 to 200 km, and error-free transmission is achieved. Many factors that affect the encryption, decryption, and transmission performance of the system have been analyzed.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Improvement of Sensitivity of Capacitive Micromachined Ultrasound Transducer
Authors:
Yifan Wang,
Yunfei Liu
Abstract:
Capacitive Micromachined Ultrasonic Transducer (CMUT) has a wild range of applications in medical detecting and imaging fields. However, operating under self-generating-self-receiving (SGSR) method usually results in poor sensitivity. But the sensitivity cannot be improved simply by increasing the resonant frequency since the frequency of a specific kind of CMUT is designed for specific usage. In…
▽ More
Capacitive Micromachined Ultrasonic Transducer (CMUT) has a wild range of applications in medical detecting and imaging fields. However, operating under self-generating-self-receiving (SGSR) method usually results in poor sensitivity. But the sensitivity cannot be improved simply by increasing the resonant frequency since the frequency of a specific kind of CMUT is designed for specific usage. In this paper, based on one specific type of CMUT, mechanical model is built and simulation analysis is demonstrated. A brand-new method one-generating-multiple-receiving (OGMR) is introduced and a special circuit model has been designed to improve the signal to thermal noise ratio. By increasing the number of receiving capacitors from 1 to 8, we increased the signal-noise ratio to 2.83 times.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning
Authors:
Chuang Zhang,
Geng Sun,
Jiahui Li,
Qingqing Wu,
Jiacheng Wang,
Dusit Niyato,
Yuanwei Liu
Abstract:
Due to flexibility and low-cost, unmanned aerial vehicles (UAVs) are increasingly crucial for enhancing coverage and functionality of wireless networks. However, incorporating UAVs into next-generation wireless communication systems poses significant challenges, particularly in sustaining high-rate and long-range secure communications against eavesdropping attacks. In this work, we consider a UAV…
▽ More
Due to flexibility and low-cost, unmanned aerial vehicles (UAVs) are increasingly crucial for enhancing coverage and functionality of wireless networks. However, incorporating UAVs into next-generation wireless communication systems poses significant challenges, particularly in sustaining high-rate and long-range secure communications against eavesdropping attacks. In this work, we consider a UAV swarm-enabled secure surveillance network system, where a UAV swarm forms a virtual antenna array to transmit sensitive surveillance data to a remote base station (RBS) via collaborative beamforming (CB) so as to resist mobile eavesdroppers. Specifically, we formulate an aerial secure communication and energy efficiency multi-objective optimization problem (ASCEE-MOP) to maximize the secrecy rate of the system and to minimize the flight energy consumption of the UAV swarm. To address the non-convex, NP-hard and dynamic ASCEE-MOP, we propose a generative diffusion model-enabled twin delayed deep deterministic policy gradient (GDMTD3) method. Specifically, GDMTD3 leverages an innovative application of diffusion models to determine optimal excitation current weights and position decisions of UAVs. The diffusion models can better capture the complex dynamics and the trade-off of the ASCEE-MOP, thereby yielding promising solutions. Simulation results highlight the superior performance of the proposed approach compared with traditional deployment strategies and some other deep reinforcement learning (DRL) benchmarks. Moreover, performance analysis under various parameter settings of GDMTD3 and different numbers of UAVs verifies the robustness of the proposed approach.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Autoregressive Speech Synthesis without Vector Quantization
Authors:
Lingwei Meng,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Bing Han,
Shujie Hu,
Yanqing Liu,
Jinyu Li,
Sheng Zhao,
Xixin Wu,
Helen Meng,
Furu Wei
Abstract:
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross…
▽ More
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis
Authors:
Zihao Wang,
Le Ma,
Yan Liu,
Kejun Zhang
Abstract:
Singing voice conversion (SVC) aims to convert a singer's voice in a given music piece to another singer while keeping the original content. We propose an end-to-end feature disentanglement-based model, which we named SaMoye, to enable zero-shot many-to-many singing voice conversion. SaMoye disentangles the features of the singing voice into content features, timbre features, and pitch features re…
▽ More
Singing voice conversion (SVC) aims to convert a singer's voice in a given music piece to another singer while keeping the original content. We propose an end-to-end feature disentanglement-based model, which we named SaMoye, to enable zero-shot many-to-many singing voice conversion. SaMoye disentangles the features of the singing voice into content features, timbre features, and pitch features respectively. The content features are enhanced using a GPT-based model to perform cross-prediction with the phoneme of the lyrics. SaMoye can generate the music with converted voice by replacing the timbre features with the target singer. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance. The dataset consists of 1500k pure singing vocal clips containing at least 10,000 singers.
△ Less
Submitted 10 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Accelerating Mobile Edge Generation (MEG) by Constrained Learning
Authors:
Xiaoxia Xu,
Yuanwei Liu,
Xidong Mu,
Hong Xing,
Arumugam Nallanathan
Abstract:
A novel accelerated mobile edge generation (MEG) framework is proposed for generating high-resolution images on mobile devices. Exploiting a large-scale latent diffusion model (LDM) distributed across edge server (ES) and user equipment (UE), cost-efficient artificial intelligence generated content (AIGC) is achieved by transmitting low-dimensional features between ES and UE. To reduce overheads o…
▽ More
A novel accelerated mobile edge generation (MEG) framework is proposed for generating high-resolution images on mobile devices. Exploiting a large-scale latent diffusion model (LDM) distributed across edge server (ES) and user equipment (UE), cost-efficient artificial intelligence generated content (AIGC) is achieved by transmitting low-dimensional features between ES and UE. To reduce overheads of both distributed computations and transmissions, a dynamic diffusion and feature merging scheme is conceived. By jointly optimizing the denoising steps and feature merging ratio, the image generation quality is maximized subject to latency and energy consumption constraints. To address this problem and tailor LDM sub-models, a low-complexity MEG acceleration protocol is developed. Particularly, a backbone meta-architecture is trained via offline distillation. Then, dynamic diffusion and feature merging are determined in online channel environment, which can be viewed as a constrained Markov Decision Process (MDP). A constrained variational policy optimization (CVPO) based MEG algorithm is further proposed for constraint-guaranteed learning, namely MEG-CVPO. Numerical results verify that: 1) The proposed framework can generate 1024$\times$1024 high-quality images over noisy channels while reducing over $40\%$ latency compared to conventional generation schemes. 2) The developed MEG-CVPO effectively mitigates constraint violations, thus flexibly controlling the trade-off between image distortion and generation costs.
△ Less
Submitted 6 August, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
RespEar: Earable-Based Robust Respiratory Rate Monitoring
Authors:
Yang Liu,
Kayla-Jade Butkow,
Jake Stuchbury-Wass,
Adam Pullin,
Dong Ma,
Cecilia Mascolo
Abstract:
Respiratory rate (RR) monitoring is integral to understanding physical and mental health and tracking fitness. Existing studies have demonstrated the feasibility of RR monitoring under specific user conditions (e.g., while remaining still, or while breathing heavily). Yet, performing accurate, continuous and non-obtrusive RR monitoring across diverse daily routines and activities remains challengi…
▽ More
Respiratory rate (RR) monitoring is integral to understanding physical and mental health and tracking fitness. Existing studies have demonstrated the feasibility of RR monitoring under specific user conditions (e.g., while remaining still, or while breathing heavily). Yet, performing accurate, continuous and non-obtrusive RR monitoring across diverse daily routines and activities remains challenging. In this work, we present RespEar, an earable-based system for robust RR monitoring. By leveraging the unique properties of in-ear microphones in earbuds, RespEar enables the use of Respiratory Sinus Arrhythmia (RSA) and Locomotor Respiratory Coupling (LRC), physiological couplings between cardiovascular activity, gait and respiration, to indirectly determine RR. This effectively addresses the challenges posed by the almost imperceptible breathing signals under daily activities. We further propose a suite of meticulously crafted signal processing schemes to improve RR estimation accuracy and robustness. With data collected from 18 subjects over 8 activities, RespEar measures RR with a mean absolute error (MAE) of 1.48 breaths per minutes (BPM) and a mean absolute percent error (MAPE) of 9.12% in sedentary conditions, and a MAE of 2.28 BPM and a MAPE of 11.04% in active conditions, respectively, which is unprecedented for a method capable of generalizing across conditions with a single modality.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Implicit Regression in Subspace for High-Sensitivity CEST Imaging
Authors:
Chu Chen,
Yang Liu,
Se Weon Park,
Jizhou Li,
Kannie W. Y. Chan,
Raymond H. F. Chan
Abstract:
Chemical Exchange Saturation Transfer (CEST) MRI demonstrates its capability in significantly enhancing the detection of proteins and metabolites with low concentrations through exchangeable protons. The clinical application of CEST, however, is constrained by its low contrast and low signal-to-noise ratio (SNR) in the acquired data. Denoising, as one of the post-processing stages for CEST data, c…
▽ More
Chemical Exchange Saturation Transfer (CEST) MRI demonstrates its capability in significantly enhancing the detection of proteins and metabolites with low concentrations through exchangeable protons. The clinical application of CEST, however, is constrained by its low contrast and low signal-to-noise ratio (SNR) in the acquired data. Denoising, as one of the post-processing stages for CEST data, can effectively improve the accuracy of CEST quantification. In this work, by modeling spatial variant z-spectrums into low-dimensional subspace, we introduce Implicit Regression in Subspace (IRIS), which is an unsupervised denoising algorithm utilizing the excellent property of implicit neural representation for continuous mapping. Experiments conducted on both synthetic and in-vivo data demonstrate that our proposed method surpasses other CEST denoising methods regarding both qualitative and quantitative performance.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports
Authors:
Yutong Zhang,
Yi Pan,
Tianyang Zhong,
Peixin Dong,
Kangni Xie,
Yuxiao Liu,
Hanqi Jiang,
Zhengliang Liu,
Shijie Zhao,
Tuo Zhang,
Xi Jiang,
Dinggang Shen,
Tianming Liu,
Xin Zhang
Abstract:
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti…
▽ More
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
AIRA: A Low-cost IR-based Approach Towards Autonomous Precision Drone Landing and NLOS Indoor Navigation
Authors:
Yanchen Liu,
Minghui Zhao,
Kaiyuan Hou,
Junxi Xia,
Charlie Carver,
Stephen Xia,
Xia Zhou,
Xiaofan Jiang
Abstract:
Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio…
▽ More
Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio sensing. We propose AIRA, a low-cost infrared light-based platform that targets precise and efficient landing of low-resource microdrones. AIRA consists of an infrared light bulb at the landing station along with an energy efficient hardware photodiode (PD) sensing platform at the bottom of the drone. AIRA costs under 83 USD, while achieving comparable performance to existing vision-based methods at a fraction of the energy cost. AIRA requires only three PDs without any complex pattern recognition models to accurately land the drone, under $10$cm of error, from up to $11.1$ meters away, compared to camera-based methods that require recognizing complex markers using high resolution images with a range of only up to $1.2$ meters from the same height. Moreover, we demonstrate that AIRA can accurately guide drones in low light and partial non line of sight scenarios, which are difficult for traditional vision-based approaches.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
LINEAR: Learning Implicit Neural Representation With Explicit Physical Priors for Accelerated Quantitative T1rho Mapping
Authors:
Yuanyuan Liu,
Jinwen Xie,
Zhuo-Xu Cui,
Qingyong Zhu,
Jing Cheng,
Dong Liang,
Yanjie Zhu
Abstract:
Quantitative T1rho mapping has shown promise in clinical and research studies. However, it suffers from long scan times. Deep learning-based techniques have been successfully applied in accelerated quantitative MR parameter mapping. However, most methods require fully-sampled training dataset, which is impractical in the clinic. In this study, a novel subject-specific unsupervised method based on…
▽ More
Quantitative T1rho mapping has shown promise in clinical and research studies. However, it suffers from long scan times. Deep learning-based techniques have been successfully applied in accelerated quantitative MR parameter mapping. However, most methods require fully-sampled training dataset, which is impractical in the clinic. In this study, a novel subject-specific unsupervised method based on the implicit neural representation is proposed to reconstruct T1rho-weighted images from highly undersampled k-space data, which only takes spatiotemporal coordinates as the input. Specifically, the proposed method learned a implicit neural representation of the MR images driven by two explicit priors from the physical model of T1rho mapping, including the signal relaxation prior and self-consistency of k-t space data prior. The proposed method was verified using both retrospective and prospective undersampled k-space data. Experiment results demonstrate that LINEAR achieves a high acceleration factor up to 14, and outperforms the state-of-the-art methods in terms of suppressing artifacts and achieving the lowest error.
△ Less
Submitted 23 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
Authors:
Ruibo Fu,
Xin Qi,
Zhengqi Wen,
Jianhua Tao,
Tao Wang,
Chunyu Qiang,
Zhiyong Wang,
Yi Lu,
Xiaopeng Wang,
Shuchen Shi,
Yukun Liu,
Xuefei Liu,
Shuai Zhang
Abstract:
Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we…
▽ More
Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we propose an Agile Speaker Representation Reinforcement Learning strategy to enhance speaker similarity in speaker adaptation tasks. ASRRL is the first work to apply reinforcement learning to improve the modeling accuracy of speaker embeddings in speaker adaptation, addressing the challenge of decoupling voice content and timbre. Our approach introduces two action strategies tailored to different reference speeches scenarios. In the single-sentence scenario, a knowledge-oriented optimal routine searching RL method is employed to expedite the exploration and retrieval of refinement information on the fringe of speaker representations. In the few-sentence scenario, we utilize a dynamic RL method to adaptively fuse reference speeches, enhancing the robustness and accuracy of speaker modeling. To achieve optimal results in the target domain, a multi-scale fusion scoring mechanism based reward model that evaluates speaker similarity, speech quality, and intelligibility across three dimensions is proposed, ensuring that improvements in speaker similarity do not compromise speech quality or intelligibility. The experimental results on the LibriTTS and VCTK datasets within mainstream TTS frameworks demonstrate the extensibility and generalization capabilities of the proposed ASRRL method. The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation
Authors:
Zihao Wang,
Haoxuan Liu,
Jiaxing Yu,
Tao Zhang,
Yan Liu,
Kejun Zhang
Abstract:
Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquia…
▽ More
Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.
△ Less
Submitted 10 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction
Authors:
Jiaxin Guo,
Jiangliu Wang,
Di Kang,
Wenzhen Dong,
Wenting Wang,
Yun-hui Liu
Abstract:
Real-time 3D reconstruction of surgical scenes plays a vital role in computer-assisted surgery, holding a promise to enhance surgeons' visibility. Recent advancements in 3D Gaussian Splatting (3DGS) have shown great potential for real-time novel view synthesis of general scenes, which relies on accurate poses and point clouds generated by Structure-from-Motion (SfM) for initialization. However, 3D…
▽ More
Real-time 3D reconstruction of surgical scenes plays a vital role in computer-assisted surgery, holding a promise to enhance surgeons' visibility. Recent advancements in 3D Gaussian Splatting (3DGS) have shown great potential for real-time novel view synthesis of general scenes, which relies on accurate poses and point clouds generated by Structure-from-Motion (SfM) for initialization. However, 3DGS with SfM fails to recover accurate camera poses and geometry in surgical scenes due to the challenges of minimal textures and photometric inconsistencies. To tackle this problem, in this paper, we propose the first SfM-free 3DGS-based method for surgical scene reconstruction by jointly optimizing the camera poses and scene representation. Based on the video continuity, the key of our method is to exploit the immediate optical flow priors to guide the projection flow derived from 3D Gaussians. Unlike most previous methods relying on photometric loss only, we formulate the pose estimation problem as minimizing the flow loss between the projection flow and optical flow. A consistency check is further introduced to filter the flow outliers by detecting the rigid and reliable points that satisfy the epipolar geometry. During 3D Gaussian optimization, we randomly sample frames to optimize the scene representations to grow the 3D Gaussian progressively. Experiments on the SCARED dataset demonstrate our superior performance over existing methods in novel view synthesis and pose estimation with high efficiency. Code is available at https://github.com/wrld/Free-SurGS.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Mobile Edge Generation-Enabled Digital Twin: Architecture Design and Research Opportunities
Authors:
Xiaoxia Xu,
Ruikang Zhong,
Xidong Mu,
Yuanwei Liu,
Kaibin Huang
Abstract:
A novel paradigm of mobile edge generation (MEG)-enabled digital twin (DT) is proposed, which enables distributed on-device generation at mobile edge networks for real-time DT applications. First, an MEG-DT architecture is put forward to decentralize generative artificial intelligence (GAI) models onto edge servers (ESs) and user equipments (UEs), which has the advantages of low latency, privacy p…
▽ More
A novel paradigm of mobile edge generation (MEG)-enabled digital twin (DT) is proposed, which enables distributed on-device generation at mobile edge networks for real-time DT applications. First, an MEG-DT architecture is put forward to decentralize generative artificial intelligence (GAI) models onto edge servers (ESs) and user equipments (UEs), which has the advantages of low latency, privacy preservation, and individual-level customization. Then, various single-user and multi-user generation mechanisms are conceived for MEG-DT, which strike trade-offs between generation latency, hardware costs, and device coordination. Furthermore, to perform efficient distributed generation, two operating protocols are explored for transmitting interpretable and latent features between ESs and UEs, namely sketch-based generation and seed-based generation, respectively. Based on the proposed protocols, the convergence between MEG and DT are highlighted. Considering the seed-based image generation scenario, numerical case studies are provided to reveal the superiority of MEG-DT over centralized generation. Finally, promising applications and research opportunities are identified.
△ Less
Submitted 6 August, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation
Authors:
Guanghao Zhu,
Jing Zhang,
Juanxiu Liu,
Xiaohui Du,
Ruqian Hao,
Yong Liu,
Lin Liu
Abstract:
Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). First…
▽ More
Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). Firstly, we design an adversarial consistency regularization (ACR) approach to enhance knowledge transfer and strengthen prediction consistency under varying perturbation intensities. Second, we apply a feature matching loss for adversarial training to incorporate high-level consistency regularization. Additionally, we present the pyramid channel attention (PCA) and efficient channel and spatial attention (ECSA) modules to improve the discriminator's performance. Finally, we propose an adaptive self-training (AST) approach to ensure the pseudo-labels' quality. The proposed AstMatch has been extensively evaluated with cutting-edge SSL methods on three public-available datasets. The experimental results under different labeled ratios indicate that AstMatch outperforms other existing methods, achieving new state-of-the-art performance. Our code will be available at https://github.com/GuanghaoZhu663/AstMatch.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Practical Power System Inertia Monitoring Based on Pumped Storage Hydropower Operation Signature
Authors:
Hongyu Li,
Chang Chen,
Mark Baldwin,
Shutang You,
Wenpeng Yu,
Lin Zhu,
Yilu Liu
Abstract:
This paper proposes a practical method to monitor power system inertia using Pumped Storage Hydropower (PSH) switching-off events. This approach offers real-time system-level inertia estimation with minimal expenses, no disruption, and the inclusion of behind-the-meter inertia. First, accurate inertia estimation is achieved through improved RoCoF calculation that accounts for pre-event RoCoF, redu…
▽ More
This paper proposes a practical method to monitor power system inertia using Pumped Storage Hydropower (PSH) switching-off events. This approach offers real-time system-level inertia estimation with minimal expenses, no disruption, and the inclusion of behind-the-meter inertia. First, accurate inertia estimation is achieved through improved RoCoF calculation that accounts for pre-event RoCoF, reducing common random frequency fluctuations in practice. Second, PSH field data is analyzed, highlighting the benefits of using switching-off events for grid inertia estimation. Third, an event detection trigger is designed to capture pump switching-off events based on local and system features. Fourth, the method is validated on the U.S. Eastern Interconnection model with over 60,000 buses, demonstrating very high accuracy (3%-5% error rate). Finally, it is applied to the U.S. Western Interconnection, with field validation showing a 9.9% average absolute error rate. Despite challenges in practical power system inertia estimation, this method enhances decision-making for power grid reliability and efficiency, addressing challenges posed by renewable energy integration.
△ Less
Submitted 1 July, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Authors:
Sefik Emre Eskimez,
Xiaofei Wang,
Manthan Thakker,
Canrun Li,
Chung-Hsien Tsai,
Zhen Xiao,
Hemin Yang,
Zirun Zhu,
Min Tang,
Xu Tan,
Yanqing Liu,
Sheng Zhao,
Naoyuki Kanda
Abstract:
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the…
▽ More
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Transformer-based segmentation of adnexal lesions and ovarian implants in CT images
Authors:
Aneesh Rangnekar,
Kevin M. Boehm,
Emily A. Aherne,
Ines Nikolovski,
Natalie Gangai,
Ying Liu,
Dimitry Zamarin,
Kara L. Roche,
Sohrab P. Shah,
Yulia Lakhman,
Harini Veeraraghavan
Abstract:
Two self-supervised pretrained transformer-based segmentation models (SMIT and Swin UNETR) fine-tuned on a dataset of ovarian cancer CT images provided reasonably accurate delineations of the tumors in an independent test dataset. Tumors in the adnexa were segmented more accurately by both transformers (SMIT and Swin UNETR) than the omental implants. AI-assisted labeling performed on 72 out of 245…
▽ More
Two self-supervised pretrained transformer-based segmentation models (SMIT and Swin UNETR) fine-tuned on a dataset of ovarian cancer CT images provided reasonably accurate delineations of the tumors in an independent test dataset. Tumors in the adnexa were segmented more accurately by both transformers (SMIT and Swin UNETR) than the omental implants. AI-assisted labeling performed on 72 out of 245 omental implants resulted in smaller manual editing effort of 39.55 mm compared to full manual correction of partial labels of 106.49 mm and resulted in overall improved accuracy performance. Both SMIT and Swin UNETR did not generate any false detection of omental metastases in the urinary bladder and relatively few false detections in the small bowel, with 2.16 cc on average for SMIT and 7.37 cc for Swin UNETR respectively.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking
Authors:
Yuwei Zhang,
Tong Xia,
Jing Han,
Yu Wu,
Georgios Rizos,
Yang Liu,
Mohammed Mosuily,
Jagmohan Chauhan,
Cecilia Mascolo
Abstract:
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing…
▽ More
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, 440 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from https://github.com/evelyn0414/OPERA.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Continuous Aperture Array (CAPA)-Based Wireless Communications: Capacity Characterization
Authors:
Boqun Zhao,
Chongjun Ouyang,
Xingqi Zhang,
Yuanwei Liu
Abstract:
The capacity limits of continuous-aperture array (CAPA)-based wireless communications are characterized. To this end, an analytically tractable transmission framework is established for both uplink and downlink CAPA systems. Based on this framework, closed-form expressions for the single-user channel capacity are derived. The results are further extended to a multiuser case by characterizing the c…
▽ More
The capacity limits of continuous-aperture array (CAPA)-based wireless communications are characterized. To this end, an analytically tractable transmission framework is established for both uplink and downlink CAPA systems. Based on this framework, closed-form expressions for the single-user channel capacity are derived. The results are further extended to a multiuser case by characterizing the capacity limits of a two-user channel and proposing the associated capacity-achieving decoding and encoding schemes. 1) For the uplink case, the sum-rate capacity and capacity region, as well as the capacity-achieving detectors, are derived. 2) For the downlink case, the uplink-downlink duality is established by deriving the uplink-to-downlink and downlink-to-uplink transformations under the same power constraint, based on which the optimal power allocation policy and the achieved sum-rate capacity and capacity region are characterized. To gain further insights, several case studies are presented by specializing the derived results into various array structures, including the planar CAPA, linear CAPA, and planar spatially discrete array (SPDA). Numerical results are provided to reveal that: i) the channel capacity achieved by CAPAs converges towards a finite upper bound as the aperture size increases; and ii) CAPAs offer significant capacity gains over the conventional SPDAs.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Hybrid Beamforming Design for Near-Field ISAC with Modular XL-MIMO
Authors:
Chunwei Meng,
Dingyou Ma,
Zhaolin Wang,
Yuanwei Liu,
Zhiqing Wei,
Zhiyong Feng
Abstract:
A novel modular extremely large-scale multiple-input-multiple-output (XL-MIMO) integrated sensing and communication (ISAC) framework is proposed in this paper. We consider a downlink ISAC scenario and exploit the modular array architecture to enhance the communication spectral efficiency and sensing resolution while reducing the channel modeling complexity by employing the hybrid spherical and pla…
▽ More
A novel modular extremely large-scale multiple-input-multiple-output (XL-MIMO) integrated sensing and communication (ISAC) framework is proposed in this paper. We consider a downlink ISAC scenario and exploit the modular array architecture to enhance the communication spectral efficiency and sensing resolution while reducing the channel modeling complexity by employing the hybrid spherical and planar wavefront model. Considering the hybrid digital-analog structure inherent to modular arrays, we formulate a joint analog-digital beamforming design problem based on the communication spectral efficiency and sensing signal-to-clutter-plus-noise ratio (SCNR). By exploring the structural similarity of the communication and sensing channels, it is proved that the optimal transmit covariance matrix lies in the subspace spanned by the subarray response vectors, yielding a closed-form solution for the optimal analog beamformer. Consequently, the joint design problem is transformed into a low-dimensional rank-constrained digital beamformer optimization. We first propose a manifold optimization method that directly optimizes the digital beamformer on the rank-constrained Stiefel manifold. Additionally, we develop an semidefinite relaxation (SDR)-based approach that relaxes the rank constraint and employ the randomization technique to obtain a near-optimal solution. Simulation results demonstrate the effectiveness of the proposed modular XL-MIMO ISAC framework and algorithms.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
IR2QSM: Quantitative Susceptibility Mapping via Deep Neural Networks with Iterative Reverse Concatenations and Recurrent Modules
Authors:
Min Li,
Chen Chen,
Zhuang Xiong,
Ying Liu,
Pengfei Rong,
Shanshan Shan,
Feng Liu,
Hongfu Sun,
Yang Gao
Abstract:
Quantitative susceptibility mapping (QSM) is an MRI phase-based post-processing technique to extract the distribution of tissue susceptibilities, demonstrating significant potential in studying neurological diseases. However, the ill-conditioned nature of dipole inversion makes QSM reconstruction from the tissue field prone to noise and artifacts. In this work, we propose a novel deep learning-bas…
▽ More
Quantitative susceptibility mapping (QSM) is an MRI phase-based post-processing technique to extract the distribution of tissue susceptibilities, demonstrating significant potential in studying neurological diseases. However, the ill-conditioned nature of dipole inversion makes QSM reconstruction from the tissue field prone to noise and artifacts. In this work, we propose a novel deep learning-based IR2QSM method for QSM reconstruction. It is designed by iterating four times of a reverse concatenations and middle recurrent modules enhanced U-net, which could dramatically improve the efficiency of latent feature utilization. Simulated and in vivo experiments were conducted to compare IR2QSM with several traditional algorithms (MEDI and iLSQR) and state-of-the-art deep learning methods (U-net, xQSM, and LPCNN). The results indicated that IR2QSM was able to obtain QSM images with significantly increased accuracy and mitigated artifacts over other methods. Particularly, IR2QSM demonstrated on average the best NRMSE (27.59%) in simulated experiments, which is 15.48%, 7.86%, 17.24%, 9.26%, and 29.13% lower than iLSQR, MEDI, U-net, xQSM, LPCNN, respectively, and led to improved QSM results with fewer artifacts for the in vivo data.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Balancing Performance and Cost for Two-Hop Cooperative Communications: Stackelberg Game and Distributed Multi-Agent Reinforcement Learning
Authors:
Yuanzhe Geng,
Erwu Liu,
Wei Ni,
Rui Wang,
Yan Liu,
Hao Xu,
Chen Cai,
Abbas Jamalipour
Abstract:
This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays for…
▽ More
This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays form an alliance in an attempt to maximize the benefit of relaying while the source aims to increase the channel capacity cost-effectively. To this end, we establish the trade problem as a Stackelberg game, and prove the existence of its equilibrium. Another important aspect is that we use multi-agent reinforcement learning (MARL) to approach the equilibrium in a situation where the instantaneous channel state information (CSI) is unavailable, and the source and relays do not have knowledge of each other's goal. A multi-agent deep deterministic policy gradient-based framework is designed, where the relay alliance and the source act as agents. Experiments demonstrate that the proposed method can obtain an acceptable performance that is close to the game-theoretic equilibrium for all players under time-invariant environments, which considerably outperforms its potential alternatives and is only about 2.9% away from the optimal solution.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Near-Field Localization and Sensing with Large-Aperture Arrays: From Signal Modeling to Processing
Authors:
Zhaolin Wang,
Parisa Ramezani,
Yuanwei Liu,
Emil Björnson
Abstract:
The signal processing community is currently witnessing a growing interest in near-field signal processing, driven by the trend towards the use of large aperture arrays with high spatial resolution in the fields of communication, localization, sensing, imaging, etc. From the perspective of localization and sensing, this trend breaks the basic far-field assumptions that have dominated the array sig…
▽ More
The signal processing community is currently witnessing a growing interest in near-field signal processing, driven by the trend towards the use of large aperture arrays with high spatial resolution in the fields of communication, localization, sensing, imaging, etc. From the perspective of localization and sensing, this trend breaks the basic far-field assumptions that have dominated the array signal processing research in the past, presenting new challenges and promising opportunities.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
Authors:
Ruibo Fu,
Shuchen Shi,
Hongming Guo,
Tao Wang,
Chunyu Qiang,
Zhengqi Wen,
Jianhua Tao,
Xin Qi,
Yi Lu,
Xiaopeng Wang,
Zhiyong Wang,
Yukun Liu,
Xuefei Liu,
Shuai Zhang,
Guanjun Li
Abstract:
Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on…
▽ More
Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT .
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Connected Speech-Based Cognitive Assessment in Chinese and English
Authors:
Saturnino Luz,
Sofia De La Fuente Garcia,
Fasih Haider,
Davida Fromm,
Brian MacWhinney,
Alyssa Lanzi,
Ya-Ning Chang,
Chia-Ju Chou,
Yi-Chien Liu
Abstract:
We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age…
▽ More
We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. The models achieved unweighted average recall was 59.2% in diagnosis, and root mean squared error of 2.89 in score prediction.
△ Less
Submitted 18 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.