-
Massive MIMO-OTFS-Based Random Access for Cooperative LEO Satellite Constellations
Authors:
Boxiao Shen,
Yongpeng Wu,
Shiqi Gong,
Heng Liu,
Björn Ottersten,
Wenjun Zhang
Abstract:
This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, t…
▽ More
This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, thereby reducing the pilot overhead. By exploiting the block sparsity of the TSLs in the angular domain, a message passing algorithm is designed for initial channel estimation. Subsequently, we examine two cooperative modes to leverage the spatial diversity within satellite constellations: the centralized mode, where computations are performed at a high-power central server, and the distributed mode, where computations are offloaded to edge satellites with minimal signaling overhead. Specifically, in the centralized mode, device identification is achieved by aggregating backhaul information from edge satellites, and channel estimation and symbol detection are jointly enhanced through a structured approximate expectation propagation (AEP) algorithm. In the distributed mode, edge satellites share channel information and exchange soft information about data symbols, leading to a distributed version of AEP. The introduced basis expansion model for TSLs enables the efficient implementation of both centralized and distributed algorithms via fast Fourier transform. Simulation results demonstrate that proposed schemes significantly outperform conventional algorithms in terms of the activity error rate, the normalized mean squared error, and the symbol error rate. Notably, the distributed mode achieves performance comparable to the centralized mode with only two exchanges of soft information about data symbols within the constellation.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Crystals with Transformers on Graphs, for Prediction of Unconventional Crystal Material Properties and the Benchmark
Authors:
Hongyi Wang,
Ji Sun,
Jinzhe Liang,
Li Zhai,
Zitian Tang,
Zijian Li,
Wei Zhai,
Xusheng Wang,
Weihao Gao,
Sheng Gong,
Bolong Huang,
Hua Zhang
Abstract:
The ionic bonding across the lattice and ordered microscopic structures endow crystals with unique symmetry and determine their macroscopic properties. Unconventional crystals, in particular, exhibit non-traditional lattice structures or possess exotic physical properties, making them intriguing subjects for investigation. Therefore, to accurately predict the physical and chemical properties of cr…
▽ More
The ionic bonding across the lattice and ordered microscopic structures endow crystals with unique symmetry and determine their macroscopic properties. Unconventional crystals, in particular, exhibit non-traditional lattice structures or possess exotic physical properties, making them intriguing subjects for investigation. Therefore, to accurately predict the physical and chemical properties of crystals, it is crucial to consider long-range orders. While GNN excels at capturing the local environment of atoms in crystals, they often face challenges in effectively capturing longer-ranged interactions due to their limited depth. In this paper, we propose CrysToGraph ($\textbf{Crys}$tals with $\textbf{T}$ransformers $\textbf{o}$n $\textbf{Graph}$s), a novel transformer-based geometric graph network designed specifically for unconventional crystalline systems, and UnconvBench, a comprehensive benchmark to evaluate models' predictive performance on unconventional crystal materials such as defected crystals, low-dimension crystals and MOF. CrysToGraph effectively captures short-range interactions with transformer-based graph convolution blocks as well as long-range interactions with graph-wise transformer blocks. CrysToGraph proofs its effectiveness in modelling unconventional crystal materials in multiple tasks, and moreover, it outperforms most existing methods, achieving new state-of-the-art results on the benchmarks of both unconventional crystals and traditional crystals.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Fast Iterative Graph Computing with Updated Neighbor States
Authors:
Yijie Zhou,
Shufeng Gong,
Feng Yao,
Hanzhang Chen,
Song Yu,
Pengxi Liu,
Yanfeng Zhang,
Ge Yu,
Jeffrey Xu Yu
Abstract:
Enhancing the efficiency of iterative computation on graphs has garnered considerable attention in both industry and academia. Nonetheless, the majority of efforts focus on expediting iterative computation by minimizing the running time per iteration step, ignoring the optimization of the number of iteration rounds, which is a crucial aspect of iterative computation. We experimentally verified the…
▽ More
Enhancing the efficiency of iterative computation on graphs has garnered considerable attention in both industry and academia. Nonetheless, the majority of efforts focus on expediting iterative computation by minimizing the running time per iteration step, ignoring the optimization of the number of iteration rounds, which is a crucial aspect of iterative computation. We experimentally verified the correlation between the vertex processing order and the number of iterative rounds, thus making it possible to reduce the number of execution rounds for iterative computation. In this paper, we propose a graph reordering method, GoGraph, which can construct a well-formed vertex processing order effectively reducing the number of iteration rounds and, consequently, accelerating iterative computation. Before delving into GoGraph, a metric function is introduced to quantify the efficiency of vertex processing order in accelerating iterative computation. This metric reflects the quality of the processing order by counting the number of edges whose source precedes the destination. GoGraph employs a divide-and-conquer mindset to establish the vertex processing order by maximizing the value of the metric function. Our experimental results show that GoGraph outperforms current state-of-the-art reordering algorithms by 1.83x on average (up to 3.34x) in runtime.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Matching-Driven Deep Reinforcement Learning for Energy-Efficient Transmission Parameter Allocation in Multi-Gateway LoRa Networks
Authors:
Ziqi Lin,
Xu Zhang,
Shimin Gong,
Lanhua Li,
Zhou Su,
Bo Gu
Abstract:
Long-range (LoRa) communication technology, distinguished by its low power consumption and long communication range, is widely used in the Internet of Things. Nevertheless, the LoRa MAC layer adopts pure ALOHA for medium access control, which may suffer from severe packet collisions as the network scale expands, consequently reducing the system energy efficiency (EE). To address this issue, it is…
▽ More
Long-range (LoRa) communication technology, distinguished by its low power consumption and long communication range, is widely used in the Internet of Things. Nevertheless, the LoRa MAC layer adopts pure ALOHA for medium access control, which may suffer from severe packet collisions as the network scale expands, consequently reducing the system energy efficiency (EE). To address this issue, it is critical to carefully allocate transmission parameters such as the channel (CH), transmission power (TP) and spreading factor (SF) to each end device (ED). Owing to the low duty cycle and sporadic traffic of LoRa networks, evaluating the system EE under various parameter settings proves to be time-consuming. Consequently, we propose an analytical model aimed at calculating the system EE while fully considering the impact of multiple gateways, duty cycling, quasi-orthogonal SFs and capture effects. On this basis, we investigate a joint CH, SF and TP allocation problem, with the objective of optimizing the system EE for uplink transmissions. Due to the NP-hard complexity of the problem, the optimization problem is decomposed into two subproblems: CH assignment and SF/TP assignment. First, a matching-based algorithm is introduced to address the CH assignment subproblem. Then, an attention-based multiagent reinforcement learning technique is employed to address the SF/TP assignment subproblem for EDs allocated to the same CH, which reduces the number of learning agents to achieve fast convergence. The simulation outcomes indicate that the proposed approach converges quickly under various parameter settings and obtains significantly better system EE than baseline algorithms.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Surprising Performances of Students with Autism in Classroom with NAO Robot
Authors:
Qin Yang,
Huan Lu,
Dandan Liang,
Shengrong Gong,
Huanghao Feng
Abstract:
Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improv…
▽ More
Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improve their educational outcomes and social capabilities. Numerous studies on autism intervention have highlighted the effectiveness of social robots in behavioral treatments. However, research on the integration of social robots into classroom settings for children with autism remains sparse. This paper describes the design and implementation of a group experiment in a collective classroom setting mediated by the NAO robot. The experiment involved special education teachers and the NAO robot collaboratively conducting classroom activities, aiming to foster a dynamic learning environment through interactions among teachers, the robot, and students. Conducted in a special education school, this experiment served as a foundational study in anticipation of extended robot-assisted classroom sessions. Data from the experiment suggest that ASD students in classrooms equipped with the NAO robot exhibited notably better performance compared to those in regular classrooms. The humanoid features and body language of the NAO robot captivated the students' attention, particularly during talent shows and command tasks, where students demonstrated heightened engagement and a decrease in stereotypical repetitive behaviors and irrelevant minor movements commonly observed in regular settings. Our preliminary findings indicate that the NAO robot significantly enhances focus and classroom engagement among students with ASD, potentially improving educational performance and fostering better social behaviors.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
Authors:
Jinghua Hou,
Tong Wang,
Xiaoqing Ye,
Zhe Liu,
Shi Gong,
Xiao Tan,
Errui Ding,
Jingdong Wang,
Xiang Bai
Abstract:
Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D…
▽ More
Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion
Authors:
Yu Cao,
Shaogang Gong
Abstract:
In the field of Few-Shot Image Generation (FSIG) using Deep Generative Models (DGMs), accurately estimating the distribution of target domain with minimal samples poses a significant challenge. This requires a method that can both capture the broad diversity and the true characteristics of the target domain distribution. We present Conditional Relaxing Diffusion Inversion (CRDI), an innovative `tr…
▽ More
In the field of Few-Shot Image Generation (FSIG) using Deep Generative Models (DGMs), accurately estimating the distribution of target domain with minimal samples poses a significant challenge. This requires a method that can both capture the broad diversity and the true characteristics of the target domain distribution. We present Conditional Relaxing Diffusion Inversion (CRDI), an innovative `training-free' approach designed to enhance distribution diversity in synthetic image generation. Distinct from conventional methods, CRDI does not rely on fine-tuning based on only a few samples. Instead, it focuses on reconstructing each target image instance and expanding diversity through few-shot learning. The approach initiates by identifying a Sample-wise Guidance Embedding (SGE) for the diffusion model, which serves a purpose analogous to the explicit latent codes in certain Generative Adversarial Network (GAN) models. Subsequently, the method involves a scheduler that progressively introduces perturbations to the SGE, thereby augmenting diversity. Comprehensive experiments demonstrates that our method surpasses GAN-based reconstruction techniques and equals state-of-the-art (SOTA) FSIG methods in performance. Additionally, it effectively mitigates overfitting and catastrophic forgetting, common drawbacks of fine-tuning approaches.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space
Authors:
Yumeng Zhang,
Shi Gong,
Kaixin Xiong,
Xiaoqing Ye,
Xiao Tan,
Fan Wang,
Jizhou Huang,
Hua Wu,
Haifeng Wang
Abstract:
World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence…
▽ More
World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence diffusion model. The multi-modal tokenizer first encodes multi-modality information and the decoder is able to reconstruct the latent BEV tokens into LiDAR and image observations by ray-casting rendering in a self-supervised manner. Then the latent BEV sequence diffusion model predicts future scenarios given action tokens as conditions. Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction. Code will be available at https://github.com/zympsyche/BevWorld.
△ Less
Submitted 18 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
Authors:
Zixu Cheng,
Yujiang Pu,
Shaogang Gong,
Parisa Kordjamshidi,
Yu Kong
Abstract:
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decomp…
▽ More
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.
△ Less
Submitted 15 July, 2024; v1 submitted 6 July, 2024;
originally announced July 2024.
-
Multi-Time Scale Service Caching and Pricing in MEC Systems with Dynamic Program Popularity
Authors:
Yiming Chen,
Xingyuan Hu,
Bo Gu,
Shimin Gong,
Zhou Su
Abstract:
In mobile edge computing systems, base stations (BSs) equipped with edge servers can provide computing services to users to reduce their task execution time. However, there is always a conflict of interest between the BS and users. The BS prices the service programs based on user demand to maximize its own profit, while the users determine their offloading strategies based on the prices to minimiz…
▽ More
In mobile edge computing systems, base stations (BSs) equipped with edge servers can provide computing services to users to reduce their task execution time. However, there is always a conflict of interest between the BS and users. The BS prices the service programs based on user demand to maximize its own profit, while the users determine their offloading strategies based on the prices to minimize their costs. Moreover, service programs need to be pre-cached to meet immediate computing needs. Due to the limited caching capacity and variations in service program popularity, the BS must dynamically select which service programs to cache. Since service caching and pricing have different needs for adjustment time granularities, we propose a two-time scale framework to jointly optimize service caching, pricing and task offloading. For the large time scale, we propose a game-nested deep reinforcement learning algorithm to dynamically adjust service caching according to the estimated popularity information. For the small time scale, by modeling the interaction between the BS and users as a two-stage game, we prove the existence of the equilibrium under incomplete information and then derive the optimal pricing and offloading strategies. Extensive simulations based on a real-world dataset demonstrate the efficiency of the proposed approach.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval
Authors:
Weitong Cai,
Jiabo Huang,
Shaogang Gong,
Hailin Jin,
Yang Liu
Abstract:
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable porti…
▽ More
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
GC-Bench: A Benchmark Framework for Graph Condensation with New Insights
Authors:
Shengbo Gong,
Juntong Ni,
Noveen Sachdeva,
Carl Yang,
Wei Jin
Abstract:
Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications such as…
▽ More
Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications such as neural architecture search and enhances our understanding of redundancy in large graphs. Despite the rapid development of GC methods, a systematic evaluation framework remains absent, which is necessary to clarify the critical designs for particular evaluative aspects. Furthermore, several meaningful questions have not been investigated, such as whether GC inherently preserves certain graph properties and offers robustness even without targeted design efforts. In this paper, we introduce GC-Bench, a comprehensive framework to evaluate recent GC methods across multiple dimensions and to generate new insights. Our experimental findings provide a deeper insights into the GC process and the characteristics of condensed graphs, guiding future efforts in enhancing performance and exploring new applications. Our code is available at \url{https://github.com/Emory-Melody/GraphSlim/tree/main/benchmark}.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
A Single-Step Non-Autoregressive Automatic Speech Recognition Architecture with High Accuracy and Inference Speed
Authors:
Ziyang Zhuang,
Chenfeng Miao,
Kun Zou,
Shuai Gong,
Ming Fang,
Tao Wei,
Zijian Li,
Wei Hu,
Shaojun Wang,
Jing Xiao
Abstract:
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, ca…
▽ More
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EfficientASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EfficientASR achieves competitive results on the AISHELL-1 and AISHELL-2 benchmarks compared to the state-of-the-art (SOTA) models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the SOTA AR Conformer with about 30x inference speedup.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels
Authors:
Weitong Cai,
Jiabo Huang,
Shaogang Gong
Abstract:
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with only the video-level video-text pairing labels (weakly-supervised). The former is poor in generalisa…
▽ More
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with only the video-level video-text pairing labels (weakly-supervised). The former is poor in generalisation to unknown concepts and/or novel scenes due to restricted dataset scale and diversity under expensive annotation costs; the latter is subject to visual-textual mis-correlations from incomplete labels. In this work, we introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer through adapting the video-text matching relationships learned from a fully-supervised source domain to a weakly-labelled target domain when they do not share a common label space. Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain. Specifically, we introduce a multiplE branch Video-text Alignment model (EVA) that performs cross-modal (visual-textual) matching information sharing and multi-modal feature alignment to optimise domain-invariant visual and textual features as well as per-task discriminative joint video-text representations. Experiments show EVA's effectiveness in exploring temporal segment annotations in a source domain to help learn video moment retrieval without temporal labels in a target domain.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer
Authors:
Zengqun Zhao,
Yu Cao,
Shaogang Gong,
Ioannis Patras
Abstract:
Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addres…
▽ More
Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.
△ Less
Submitted 18 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Can We Enhance the Quality of Mobile Crowdsensing Data Without Ground Truth?
Authors:
Jiajie Li,
Bo Gu,
Shimin Gong,
Zhou Su,
Mohsen Guizani
Abstract:
Mobile crowdsensing (MCS) has emerged as a prominent trend across various domains. However, ensuring the quality of the sensing data submitted by mobile users (MUs) remains a complex and challenging problem. To address this challenge, an advanced method is required to detect low-quality sensing data and identify malicious MUs that may disrupt the normal operations of an MCS system. Therefore, this…
▽ More
Mobile crowdsensing (MCS) has emerged as a prominent trend across various domains. However, ensuring the quality of the sensing data submitted by mobile users (MUs) remains a complex and challenging problem. To address this challenge, an advanced method is required to detect low-quality sensing data and identify malicious MUs that may disrupt the normal operations of an MCS system. Therefore, this article proposes a prediction- and reputation-based truth discovery (PRBTD) framework, which can separate low-quality data from high-quality data in sensing tasks. First, we apply a correlation-focused spatial-temporal transformer network to predict the ground truth of the input sensing data. Then, we extract the sensing errors of the data as features based on the prediction results to calculate the implications among the data. Finally, we design a reputation-based truth discovery (TD) module for identifying low-quality data with their implications. Given sensing data submitted by MUs, PRBTD can eliminate the data with heavy noise and identify malicious MUs with high accuracy. Extensive experimental results demonstrate that PRBTD outperforms the existing methods in terms of identification accuracy and data quality enhancement.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Facilitating Feature and Topology Lightweighting: An Ethereum Transaction Graph Compression Method for Malicious Account Detection
Authors:
Jiajun Zhou,
Xuanze Chen,
Shengbo Gong,
Chenkai Hu,
Chengxiang Jin,
Shanqing Yu,
Qi Xuan
Abstract:
Ethereum has become one of the primary global platforms for cryptocurrency, playing an important role in promoting the diversification of the financial ecosystem. However, the relative lag in regulation has led to a proliferation of malicious activities in Ethereum, posing a serious threat to fund security. Existing regulatory methods usually detect malicious accounts through feature engineering o…
▽ More
Ethereum has become one of the primary global platforms for cryptocurrency, playing an important role in promoting the diversification of the financial ecosystem. However, the relative lag in regulation has led to a proliferation of malicious activities in Ethereum, posing a serious threat to fund security. Existing regulatory methods usually detect malicious accounts through feature engineering or large-scale transaction graph mining. However, due to the immense scale of transaction data and malicious attacks, these methods suffer from inefficiency and low robustness during data processing and anomaly detection. In this regard, we propose an Ethereum Transaction Graph Compression method named TGC4Eth, which assists malicious account detection by lightweighting both features and topology of the transaction graph. At the feature level, we select transaction features based on their low importance to improve the robustness of the subsequent detection models against feature evasion attacks; at the topology level, we employ focusing and coarsening processes to compress the structure of the transaction graph, thereby improving both data processing and inference efficiency of detection models. Extensive experiments demonstrate that TGC4Eth significantly improves the computational efficiency of existing detection models while preserving the connectivity of the transaction graph. Furthermore, TGC4Eth enables existing detection models to maintain stable performance and exhibit high robustness against feature evasion attacks.
△ Less
Submitted 1 July, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
AoI-aware Sensing Scheduling and Trajectory Optimization for Multi-UAV-assisted Wireless Backscatter Networks
Authors:
Yusi Long,
Songhan Zhao,
Shimin Gong,
Bo Gu,
Dusit Niyato,
Xuemin,
Shen
Abstract:
This paper considers multiple unmanned aerial vehicles (UAVs) to assist sensing data transmissions from the ground users (GUs) to a remote base station (BS). Each UAV collects sensing data from the GUs and then forwards the sensing data to the remote BS. The GUs first backscatter their data to the UAVs and then all UAVs forward data to the BS by the nonorthogonal multiple access (NOMA) transmissio…
▽ More
This paper considers multiple unmanned aerial vehicles (UAVs) to assist sensing data transmissions from the ground users (GUs) to a remote base station (BS). Each UAV collects sensing data from the GUs and then forwards the sensing data to the remote BS. The GUs first backscatter their data to the UAVs and then all UAVs forward data to the BS by the nonorthogonal multiple access (NOMA) transmissions. We formulate a multi-stage stochastic optimization problem to minimize the long-term time-averaged age-of-information (AoI) by jointly optimizing the GUs' access control, the UAVs' beamforming, and trajectory planning strategies. To solve this problem, we first model the dynamics of the GUs' AoI statuses by virtual queueing systems, and then propose the AoI-aware sensing scheduling and trajectory optimization (AoI-STO) algorithm. This allows us to transform the multi-stage AoI minimization problem into a series of per-slot control problems by using the Lyapunov optimization framework. In each time slot, the GUs' access control, the UAVs' beamforming, and mobility control strategies are updated by using the block coordinate descent (BCD) method according to the instant GUs' AoI statuses. Simulation results reveal that the proposed AoI-STO algorithm can reduce the overall AoI by more than 50%. The GUs' scheduling fairness is also improved greatly by adapting the GUs' access control compared with typical baseline schemes.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
BAMBOO: a predictive and transferable machine learning force field framework for liquid electrolyte development
Authors:
Sheng Gong,
Yumin Zhang,
Zhenliang Mu,
Zhichen Pu,
Hongyi Wang,
Zhiao Yu,
Mengyi Chen,
Tianze Zheng,
Zhi Wang,
Lifei Chen,
Xiaojie Wu,
Shaochen Shi,
Weihao Gao,
Wen Yan,
Liang Xiang
Abstract:
Despite the widespread applications of machine learning force field (MLFF) on solids and small molecules, there is a notable gap in applying MLFF to complex liquid electrolytes. In this work, we introduce BAMBOO (ByteDance AI Molecular Simulation Booster), a novel framework for molecular dynamics (MD) simulations, with a demonstration of its capabilities in the context of liquid electrolytes for l…
▽ More
Despite the widespread applications of machine learning force field (MLFF) on solids and small molecules, there is a notable gap in applying MLFF to complex liquid electrolytes. In this work, we introduce BAMBOO (ByteDance AI Molecular Simulation Booster), a novel framework for molecular dynamics (MD) simulations, with a demonstration of its capabilities in the context of liquid electrolytes for lithium batteries. We design a physics-inspired graph equivariant transformer architecture as the backbone of BAMBOO to learn from quantum mechanical simulations. Additionally, we pioneer an ensemble knowledge distillation approach and apply it on MLFFs to improve the stability of MD simulations. Finally, we propose the density alignment algorithm to align BAMBOO with experimental measurements. BAMBOO demonstrates state-of-the-art accuracy in predicting key electrolyte properties such as density, viscosity, and ionic conductivity across various solvents and salt combinations. Our current model, trained on more than 15 chemical species, achieves the average density error of 0.01 g/cm$^3$ on various compositions compared with experimental data. Moreover, our model demonstrates transferability to molecules not included in the quantum mechanical dataset. We envision this work as paving the way to a "universal MLFF" capable of simulating properties of common organic liquids.
△ Less
Submitted 22 April, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
ATFNet: Adaptive Time-Frequency Ensembled Network for Long-term Time Series Forecasting
Authors:
Hengyu Ye,
Jiadong Chen,
Shijin Gong,
Fuxin Jiang,
Tieying Zhang,
Jianjun Chen,
Xiaofeng Gao
Abstract:
The intricate nature of time series data analysis benefits greatly from the distinct advantages offered by time and frequency domain representations. While the time domain is superior in representing local dependencies, particularly in non-periodic series, the frequency domain excels in capturing global dependencies, making it ideal for series with evident periodic patterns. To capitalize on both…
▽ More
The intricate nature of time series data analysis benefits greatly from the distinct advantages offered by time and frequency domain representations. While the time domain is superior in representing local dependencies, particularly in non-periodic series, the frequency domain excels in capturing global dependencies, making it ideal for series with evident periodic patterns. To capitalize on both of these strengths, we propose ATFNet, an innovative framework that combines a time domain module and a frequency domain module to concurrently capture local and global dependencies in time series data. Specifically, we introduce Dominant Harmonic Series Energy Weighting, a novel mechanism for dynamically adjusting the weights between the two modules based on the periodicity of the input time series. In the frequency domain module, we enhance the traditional Discrete Fourier Transform (DFT) with our Extended DFT, designed to address the challenge of discrete frequency misalignment. Additionally, our Complex-valued Spectrum Attention mechanism offers a novel approach to discern the intricate relationships between different frequency combinations. Extensive experiments across multiple real-world datasets demonstrate that our ATFNet framework outperforms current state-of-the-art methods in long-term time series forecasting.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training
Authors:
Shizhan Gong,
Qi Dou,
Farzan Farnia
Abstract:
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However, standard gradient-based interpretation maps, including the simple gradient and integrated gradient algorithms, often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A frequently used approach to inducing spars…
▽ More
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However, standard gradient-based interpretation maps, including the simple gradient and integrated gradient algorithms, often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A frequently used approach to inducing sparsity structures into gradient-based saliency maps is to alter the simple gradient scheme using sparsification or norm-based regularization. A drawback with such post-processing methods is their frequently-observed significant loss in fidelity to the original simple gradient map. In this work, we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We show a duality relation between the regularized norms of the adversarial perturbations and gradient-based maps, based on which we design adversarial training loss functions promoting sparsity and group-sparsity properties in simple gradient maps. We present several numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Robust Beamforming Design and Antenna Selection for Dynamic HRIS-aided Massive MIMO Systems
Authors:
Jintao Wang,
Binggui Zhou,
Chengzhi Ma,
Shiqi Gong,
Guanghua Yang,
Shaodan Ma
Abstract:
In this paper, a dynamic hybrid active-passive reconfigurable intelligent surface (HRIS) is proposed to further enhance the massive multiple-input-multiple-output (MIMO) system, since it supports the dynamic placement of active and passive elements. Specifically, considering the impact of the hardware impairments (HWIs), we investigate the channel-aware configuration of the receive antennas at the…
▽ More
In this paper, a dynamic hybrid active-passive reconfigurable intelligent surface (HRIS) is proposed to further enhance the massive multiple-input-multiple-output (MIMO) system, since it supports the dynamic placement of active and passive elements. Specifically, considering the impact of the hardware impairments (HWIs), we investigate the channel-aware configuration of the receive antennas at the base station (BS) and the active/passive elements at the HRIS to improve the reliability of system. To this end, we investigate the average mean-square-error (MSE) minimization problem for the HRIS-aided massive MIMO system by jointly optimizing the BS receive antenna selection matrix, the reflection phase coefficients, the reflection amplitude matrix, and the mode selection matrix of the HRIS under the power budget of the HRIS. To tackle the non-convexity and intractability of this problem, we first transform the binary and discrete variables into continuous ones, and then propose a penalty-based exact block coordinate descent (BCD) algorithm to solve these subproblems alternately. Numerical simulations demonstrate the great superiority of the proposed scheme over the conventional benchmark schemes.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Primary Rate Maximization in Movable Antennas Empowered Symbiotic Radio Communications
Authors:
Bin Lyu,
Hao Liu,
Wenqing Hong,
Shimin Gong,
Feng Tian
Abstract:
In this paper, we propose a movable antenna (MA) empowered scheme for symbiotic radio (SR) communication systems. Specifically, multiple antennas at the primary transmitter (PT) can be flexibly moved to favorable locations to boost the channel conditions of the primary and secondary transmissions. The primary transmission is achieved by the active transmission from the PT to the primary user (PU),…
▽ More
In this paper, we propose a movable antenna (MA) empowered scheme for symbiotic radio (SR) communication systems. Specifically, multiple antennas at the primary transmitter (PT) can be flexibly moved to favorable locations to boost the channel conditions of the primary and secondary transmissions. The primary transmission is achieved by the active transmission from the PT to the primary user (PU), while the backscatter device (BD) takes a ride over the incident signal from the PT to passively send the secondary signal to the PU. Under this setup, we consider a primary rate maximization problem by jointly optimizing the transmit beamforming and the positions of MAs at the PT under a practical bit error rate constraint on the secondary transmission. Then, an alternating optimization framework with the utilization of the successive convex approximation, semi-definite processing and simulated annealing (SA) modified particle swarm optimization (SA-PSO) methods is proposed to find the solution of the transmit beamforming and MAs' positions. Finally, numerical results are provided to demonstrate the performance improvement provided by the proposed MA empowered scheme and the proposed algorithm.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Edge-based Parametric Digital Twins for Intelligent Building Indoor Climate Modeling
Authors:
Zhongjun Ni,
Chi Zhang,
Magnus Karlsson,
Shaofang Gong
Abstract:
Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse ser…
▽ More
Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse service systems equipped by different buildings. Based on created digital twins and collected data, deep learning methods are employed to develop predictive models for identifying patterns in indoor climate and providing insights. Both the parametric digital twin and deep learning models are deployed on edge for low latency and privacy compliance. As a demonstration, a case study was conducted in a historic building in Östergötland, Sweden, to compare the performance of five deep learning architectures. The results indicate that the time-series dense encoder model exhibited strong competitiveness in performing multi-horizon forecasts of indoor temperature and relative humidity with low computational costs.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Training-Free Long-Context Scaling of Large Language Models
Authors:
Chenxin An,
Fei Huang,
Jun Zhang,
Shansan Gong,
Xipeng Qiu,
Chang Zhou,
Lingpeng Kong
Abstract:
The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By…
▽ More
The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at \url{https://github.com/HKUNLP/ChunkLlama}.
△ Less
Submitted 29 May, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
The Umeyama algorithm for matching correlated Gaussian geometric models in the low-dimensional regime
Authors:
Shuyang Gong,
Zhangsong Li
Abstract:
Motivated by the problem of matching two correlated random geometric graphs, we study the problem of matching two Gaussian geometric models correlated through a latent node permutation. Specifically, given an unknown permutation $π^*$ on $\{1,\ldots,n\}$ and given $n$ i.i.d. pairs of correlated Gaussian vectors $\{X_{π^*(i)},Y_i\}$ in $\mathbb{R}^d$ with noise parameter $σ$, we consider two types…
▽ More
Motivated by the problem of matching two correlated random geometric graphs, we study the problem of matching two Gaussian geometric models correlated through a latent node permutation. Specifically, given an unknown permutation $π^*$ on $\{1,\ldots,n\}$ and given $n$ i.i.d. pairs of correlated Gaussian vectors $\{X_{π^*(i)},Y_i\}$ in $\mathbb{R}^d$ with noise parameter $σ$, we consider two types of (correlated) weighted complete graphs with edge weights given by $A_{i,j}=\langle X_i,X_j \rangle$, $B_{i,j}=\langle Y_i,Y_j \rangle$. The goal is to recover the hidden vertex correspondence $π^*$ based on the observed matrices $A$ and $B$. For the low-dimensional regime where $d=O(\log n)$, Wang, Wu, Xu, and Yolou [WWXY22+] established the information thresholds for exact and almost exact recovery in matching correlated Gaussian geometric models. They also conducted numerical experiments for the classical Umeyama algorithm. In our work, we prove that this algorithm achieves exact recovery of $π^*$ when the noise parameter $σ=o(d^{-3}n^{-2/d})$, and almost exact recovery when $σ=o(d^{-3}n^{-1/d})$. Our results approach the information thresholds up to a $\operatorname{poly}(d)$ factor in the low-dimensional regime.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
Authors:
Xueliang Zhao,
Xinting Huang,
Tingchen Fu,
Qintong Li,
Shansan Gong,
Lemao Liu,
Wei Bi,
Lingpeng Kong
Abstract:
Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively lever…
▽ More
Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the \underline{B}i-Modal \underline{B}ehavioral \underline{A}lignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving ($28.34\% \to 34.22\%$), chess positional advantage prediction ($42.08\% \to 46.99\%$) and molecular property prediction ($77.47\% \to 83.52\%$).
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models
Authors:
Jiacheng Ye,
Shansan Gong,
Liheng Chen,
Lin Zheng,
Jiahui Gao,
Han Shi,
Chuan Wu,
Xin Jiang,
Zhenguo Li,
Wei Bi,
Lingpeng Kong
Abstract:
Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language m…
▽ More
Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.
△ Less
Submitted 15 July, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A Comprehensive Survey on Graph Reduction: Sparsification, Coarsening, and Condensation
Authors:
Mohammad Hashemi,
Shengbo Gong,
Juntong Ni,
Wenqi Fan,
B. Aditya Prakash,
Wei Jin
Abstract:
Many real-world datasets can be naturally represented as graphs, spanning a wide range of domains. However, the increasing complexity and size of graph datasets present significant challenges for analysis and computation. In response, graph reduction, or graph summarization, has gained prominence for simplifying large graphs while preserving essential properties. In this survey, we aim to provide…
▽ More
Many real-world datasets can be naturally represented as graphs, spanning a wide range of domains. However, the increasing complexity and size of graph datasets present significant challenges for analysis and computation. In response, graph reduction, or graph summarization, has gained prominence for simplifying large graphs while preserving essential properties. In this survey, we aim to provide a comprehensive understanding of graph reduction methods, including graph sparsification, graph coarsening, and graph condensation. Specifically, we establish a unified definition for these methods and introduce a hierarchical taxonomy to categorize the challenges they address. Our survey then systematically reviews the technical details of these methods and emphasizes their practical applications across diverse scenarios. Furthermore, we outline critical research directions to ensure the continued effectiveness of graph reduction techniques, as well as provide a comprehensive paper list at \url{https://github.com/Emory-Melody/awesome-graph-reduction}. We hope this survey will bridge literature gaps and propel the advancement of this promising field.
△ Less
Submitted 29 June, 2024; v1 submitted 28 January, 2024;
originally announced February 2024.
-
Semantic Entropy Can Simultaneously Benefit Transmission Efficiency and Channel Security of Wireless Semantic Communications
Authors:
Yankai Rong,
Guoshun Nan,
Minwei Zhang,
Sihan Chen,
Songtao Wang,
Xuefei Zhang,
Nan Ma,
Shixun Gong,
Zhaohui Yang,
Qimei Cui,
Xiaofeng Tao,
Tony Q. S. Quek
Abstract:
Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of tra…
▽ More
Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of transmission efficiency in wireless semantic communications while also alleviating its security disadvantages?". Keeping this in mind, we propose SemEntropy, a novel method that answers the above question by exploring the semantics of data for both adaptive transmission and physical layer encryption. Specifically, we first introduce semantic entropy, which indicates the expectation of various semantic scores regarding the transmission goal of the DLSC. Equipped with such semantic entropy, we can dynamically assign informative semantics to Orthogonal Frequency Division Multiplexing (OFDM) subcarriers with better channel conditions in a fine-grained manner. We also use the entropy to guide semantic key generation to safeguard communications over open wireless channels. By doing so, both transmission efficiency and channel security can be simultaneously improved. Extensive experiments over various benchmarks show the effectiveness of the proposed SemEntropy. We discuss the reason why our proposed method benefits secure transmission of DLSC, and also give some interesting findings, e.g., SemEntropy can keep the semantic accuracy remain 95% with 60% less transmission.
△ Less
Submitted 6 February, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
A Unified Framework of Multi-Stage Multi-Winner Voting: An Axiomatic Exploration
Authors:
Shengjie Gong,
Lingxiao Huang,
Shuangping Huang,
Yuyi Wang,
Zhiqi Wang,
Tao Xiao,
Xiang Yan,
Chunxue Yang
Abstract:
Multi-winner voting plays a crucial role in selecting representative committees based on voter preferences. Previous research has predominantly focused on single-stage voting rules, which are susceptible to manipulation during preference collection. In order to mitigate manipulation and increase the cost associated with it, we propose the introduction of multiple stages in the voting procedure, le…
▽ More
Multi-winner voting plays a crucial role in selecting representative committees based on voter preferences. Previous research has predominantly focused on single-stage voting rules, which are susceptible to manipulation during preference collection. In order to mitigate manipulation and increase the cost associated with it, we propose the introduction of multiple stages in the voting procedure, leading to the development of a unified framework of multi-stage multi-winner voting rules. To shed light on this framework of voting methods, we conduct an axiomatic study, establishing provable conditions for achieving desired axioms within our model. Our theoretical findings can serve as a guide for the selection of appropriate multi-stage multi-winner voting rules.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Exploiting Low-level Representations for Ultra-Fast Road Segmentation
Authors:
Huan Zhou,
Feng Xue,
Yucong Li,
Shi Gong,
Yiqun Li,
Yu Zhou
Abstract:
Achieving real-time and accuracy on embedded platforms has always been the pursuit of road segmentation methods. To this end, they have proposed many lightweight networks. However, they ignore the fact that roads are "stuff" (background or environmental elements) rather than "things" (specific identifiable objects), which inspires us to explore the feasibility of representing roads with low-level…
▽ More
Achieving real-time and accuracy on embedded platforms has always been the pursuit of road segmentation methods. To this end, they have proposed many lightweight networks. However, they ignore the fact that roads are "stuff" (background or environmental elements) rather than "things" (specific identifiable objects), which inspires us to explore the feasibility of representing roads with low-level instead of high-level features. Surprisingly, we find that the primary stage of mainstream network models is sufficient to represent most pixels of the road for segmentation. Motivated by this, we propose a Low-level Feature Dominated Road Segmentation network (LFD-RoadSeg). Specifically, LFD-RoadSeg employs a bilateral structure. The spatial detail branch is firstly designed to extract low-level feature representation for the road by the first stage of ResNet-18. To suppress texture-less regions mistaken as the road in the low-level feature, the context semantic branch is then designed to extract the context feature in a fast manner. To this end, in the second branch, we asymmetrically downsample the input image and design an aggregation module to achieve comparable receptive fields to the third stage of ResNet-18 but with less time consumption. Finally, to segment the road from the low-level feature, a selective fusion module is proposed to calculate pixel-wise attention between the low-level representation and context feature, and suppress the non-road low-level response by this attention. On KITTI-Road, LFD-RoadSeg achieves a maximum F1-measure (MaxF) of 95.21% and an average precision of 93.71%, while reaching 238 FPS on a single TITAN Xp and 54 FPS on a Jetson TX2, all with a compact model size of just 936k parameters. The source code is available at https://github.com/zhouhuan-hust/LFD-RoadSeg.
△ Less
Submitted 6 February, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval
Authors:
Dezhao Luo,
Shaogang Gong,
Jiabo Huang,
Hailin Jin,
Yang Liu
Abstract:
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent dev…
▽ More
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-text associations, existing methods resort to joint training on both source and target domain videos for cross-domain applications. Meanwhile, recent developments in vision-language multimodal models pre-trained on large-scale image-text and/or video-text pairs are only based on coarse associations (weakly labelled). They are inadequate to provide fine-grained moment-text correlations required for cross-domain VMR. In this work, we solve the problem of unseen cross-domain VMR, where certain visual and textual concepts do not overlap across domains, by only utilising target domain sentences (text prompts) without accessing their videos. To that end, we explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences, enabling us to simulate target domain videos. We address two problems in video editing for optimising unseen domain VMR: (1) generation of high-quality simulation videos of different moments with subtle distinctions, (2) selection of simulation videos that complement existing source training videos without introducing harmful noise or unnecessary repetitions. On the first problem, we formulate a two-stage video diffusion generation controlled simultaneously by (1) the original video structure of a source video, (2) subject specifics, and (3) a target sentence prompt. This ensures fine-grained variations between video moments. On the second problem, we introduce a hybrid selection mechanism that combines two quantitative metrics for noise filtering and one qualitative metric for leveraging VMR prediction on simulation video selection.
△ Less
Submitted 29 January, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Joint Beamforming Optimization and Mode Selection for RDARS-aided MIMO Systems
Authors:
Jintao Wang,
Chengzhi Ma,
Shiqi Gong,
Xi Yang,
Shaodan Ma
Abstract:
Considering the appealing distribution gains of distributed antenna systems (DAS) and passive gains of reconfigurable intelligent surface (RIS), a flexible reconfigurable architecture called reconfigurable distributed antenna and reflecting surface (RDARS) is proposed. RDARS encompasses DAS and RIS as two special cases and maintains the advantages of distributed antennas while reducing the hardwar…
▽ More
Considering the appealing distribution gains of distributed antenna systems (DAS) and passive gains of reconfigurable intelligent surface (RIS), a flexible reconfigurable architecture called reconfigurable distributed antenna and reflecting surface (RDARS) is proposed. RDARS encompasses DAS and RIS as two special cases and maintains the advantages of distributed antennas while reducing the hardware cost by replacing some active antennas with low-cost passive reflecting surfaces. In this paper, we present a RDARS-aided uplink multi-user communication system and investigate the system transmission reliability with the newly proposed architecture. Specifically, in addition to the distribution gain and the reflection gain provided by the connection and reflection modes, respectively, we also consider the dynamic mode switching of each element which introduces an additional degree of freedom (DoF) and thus results in a selection gain. As such, we aim to minimize the total sum mean-square-error (MSE) of all data streams by jointly optimizing the receive beamforming matrix, the reflection phase shifts and the channel-aware placement of elements in the connection mode. To tackle this nonconvex problem with intractable binary and cardinality constraints, we propose an inexact block coordinate descent (BCD) based penalty dual decomposition (PDD) algorithm with the guaranteed convergence. Since the PDD algorithm usually suffers from high computational complexity, a low-complexity greedy-search-based alternating optimization (AO) algorithm is developed to yield a semi-closed-form solution with acceptable performance. Numerical results demonstrate the superiority of the proposed architecture compared to the conventional fully passive RIS or DAS. Furthermore, some insights about the practical implementation of RDARS are provided.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking
Authors:
Shitong Sun,
Fanghua Ye,
Shaogang Gong
Abstract:
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised…
▽ More
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image. To avoid difficult to-obtain labeled triplet training data, zero-shot composed image retrieval (ZS-CIR) has been introduced, which aims to retrieve the target image by learning from image-text pairs (self-supervised triplets), without the need for human-labeled triplets. However, this self-supervised triplet learning approach is computationally less effective and less understandable as it assumes the interaction between image and text is conducted with implicit query embedding without explicit semantical interpretation. In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. This helps improve model learning efficiency to enhance the generalization capacity of foundation models. Further, we introduce a Local Concept Re-ranking (LCR) mechanism to focus on discriminative local information extracted from the modified instructions. Extensive experiments on four ZS-CIR benchmarks show that our method achieves comparable performances to that of the state of-the-art triplet training based methods, but significantly outperforms other training-free methods on the open domain datasets (CIRR, CIRCO and COCO), as well as the fashion domain dataset (FashionIQ).
△ Less
Submitted 24 March, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects
Authors:
Jian Hu,
Jiayi Lin,
Weitong Cai,
Shaogang Gong
Abstract:
Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasib…
▽ More
Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting the targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.
△ Less
Submitted 18 December, 2023; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Benchmarking Robustness of Text-Image Composed Retrieval
Authors:
Shitong Sun,
Jindong Gu,
Shaogang Gong
Abstract:
Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robust…
▽ More
Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied. In this paper, we perform the first robustness study and establish three new diversified benchmarks for systematic analysis of text-image composed retrieval against natural corruptions in both vision and text and further probe textural understanding. For natural corruption analysis, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain respectively, both of which apply 15 visual corruptions and 7 textural corruptions. For textural understanding analysis, we introduce a new diagnostic dataset CIRR-D by expanding the original raw data with synthetic data, which contains modified text to better probe textual understanding ability including numerical variation, attribute variation, object removal, background variation, and fine-grained evaluation. The code and benchmark datasets are available at https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval.
△ Less
Submitted 30 November, 2023; v1 submitted 24 November, 2023;
originally announced November 2023.
-
A Framework on Complex Matrix Derivatives with Special Structure Constraints for Wireless Systems
Authors:
Xin Ju,
Shiqi Gong,
Nan Zhao,
Chengwen Xing,
Arumugam Nallanathan,
Dusit Niyato
Abstract:
Matrix-variate optimization plays a central role in advanced wireless system designs. In this paper, we aim to explore optimal solutions of matrix variables under two special structure constraints using complex matrix derivatives, including diagonal structure constraints and constant modulus constraints, both of which are closely related to the state-of-the-art wireless applications. Specifically,…
▽ More
Matrix-variate optimization plays a central role in advanced wireless system designs. In this paper, we aim to explore optimal solutions of matrix variables under two special structure constraints using complex matrix derivatives, including diagonal structure constraints and constant modulus constraints, both of which are closely related to the state-of-the-art wireless applications. Specifically, for diagonal structure constraints mostly considered in the uplink multi-user single-input multiple-output (MU-SIMO) system and the amplitude-adjustable intelligent reflecting surface (IRS)-aided multiple-input multiple-output (MIMO) system, the capacity maximization problem, the mean-squared error (MSE) minimization problem and their variants are rigorously investigated. By leveraging complex matrix derivatives, the optimal solutions of these problems are directly obtained in closed forms. Nevertheless, for constant modulus constraints with the intrinsic nature of element-wise decomposability, which are often seen in the hybrid analog-digital MIMO system and the fully-passive IRS-aided MIMO system, we firstly explore inherent structures of the element-wise phase derivatives associated with different optimization problems. Then, we propose a novel alternating optimization (AO) algorithm with the aid of several arbitrary feasible solutions, which avoids the complicated matrix inversion and matrix factorization involved in conventional element-wise iterative algorithms. Numerical simulations reveal that the proposed algorithm can dramatically reduce the computational complexity without loss of system performance.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Mitigate Domain Shift by Primary-Auxiliary Objectives Association for Generalizing Person ReID
Authors:
Qilei Li,
Shaogang Gong
Abstract:
While deep learning has significantly improved ReID model accuracy under the independent and identical distribution (IID) assumption, it has also become clear that such models degrade notably when applied to an unseen novel domain due to unpredictable/unknown domain shift. Contemporary domain generalization (DG) ReID models struggle in learning domain-invariant representation solely through traini…
▽ More
While deep learning has significantly improved ReID model accuracy under the independent and identical distribution (IID) assumption, it has also become clear that such models degrade notably when applied to an unseen novel domain due to unpredictable/unknown domain shift. Contemporary domain generalization (DG) ReID models struggle in learning domain-invariant representation solely through training on an instance classification objective. We consider that a deep learning model is heavily influenced and therefore biased towards domain-specific characteristics, e.g., background clutter, scale and viewpoint variations, limiting the generalizability of the learned model, and hypothesize that the pedestrians are domain invariant owning they share the same structural characteristics. To enable the ReID model to be less domain-specific from these pure pedestrians, we introduce a method that guides model learning of the primary ReID instance classification objective by a concurrent auxiliary learning objective on weakly labeled pedestrian saliency detection. To solve the problem of conflicting optimization criteria in the model parameter space between the two learning objectives, we introduce a Primary-Auxiliary Objectives Association (PAOA) mechanism to calibrate the loss gradients of the auxiliary task towards the primary learning task gradients. Benefiting from the harmonious multitask learning design, our model can be extended with the recent test-time diagram to form the PAOA+, which performs on-the-fly optimization against the auxiliary objective in order to maximize the model's generative capacity in the test target domain. Experiments demonstrate the superiority of the proposed PAOA model.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
Authors:
Shansan Gong,
Mukai Li,
Jiangtao Feng,
Zhiyong Wu,
Lingpeng Kong
Abstract:
Diffusion models have gained prominence in generating high-quality sequences of text. Nevertheless, current approaches predominantly represent discrete text within a continuous diffusion space, which incurs substantial computational overhead during training and results in slower sampling speeds. In this paper, we introduce a soft absorbing state that facilitates the diffusion model in learning to…
▽ More
Diffusion models have gained prominence in generating high-quality sequences of text. Nevertheless, current approaches predominantly represent discrete text within a continuous diffusion space, which incurs substantial computational overhead during training and results in slower sampling speeds. In this paper, we introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space, thereby enhancing its capacity to recover conditional signals. During the sampling phase, we employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Comprehensive experimental evaluations reveal that our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application. \footnote{The code is released at \url{https://github.com/Shark-NLP/DiffuSeq}
△ Less
Submitted 16 October, 2023; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Multi-triplet Feature Augmentation for Ponzi Scheme Detection in Ethereum
Authors:
Chengxiang Jin,
Jiajun Zhou,
Shengbo Gong,
Chenxuan Xie,
Qi Xuan
Abstract:
Blockchain technology revolutionizes the Internet, but also poses increasing risks, particularly in cryptocurrency finance. On the Ethereum platform, Ponzi schemes, phishing scams, and a variety of other frauds emerge. Existing Ponzi scheme detection approaches based on heterogeneous transaction graph modeling leverages semantic information between node (account) pairs to establish connections, ov…
▽ More
Blockchain technology revolutionizes the Internet, but also poses increasing risks, particularly in cryptocurrency finance. On the Ethereum platform, Ponzi schemes, phishing scams, and a variety of other frauds emerge. Existing Ponzi scheme detection approaches based on heterogeneous transaction graph modeling leverages semantic information between node (account) pairs to establish connections, overlooking the semantic attributes inherent to the edges (interactions). To overcome this, we construct heterogeneous Ethereum interaction graphs with multiple triplet interaction patterns to better depict the real Ethereum environment. Based on this, we design a new framework named multi-triplet augmented heterogeneous graph neural network (MAHGNN) for Ponzi scheme detection. We introduce the Conditional Variational Auto Encoder (CVAE) to capture the semantic information of different triplet interaction patterns, which facilitates the characterization on account features. Extensive experiments demonstrate that MAHGNN is capable of addressing the problem of multi-edge interactions in heterogeneous Ethereum interaction graphs and achieving state-of-the-art performance in Ponzi scheme detection.
△ Less
Submitted 1 October, 2023;
originally announced October 2023.
-
Multiagent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks
Authors:
Xu Zhang,
Ziqi Lin,
Shimin Gong,
Bo Gu,
Dusit Niyato
Abstract:
Long Range (LoRa) wireless technology, characterized by low power consumption and a long communication range, is regarded as one of the enabling technologies for the Industrial Internet of Things (IIoT). However, as the network scale increases, the energy efficiency (EE) of LoRa networks decreases sharply due to severe packet collisions. To address this issue, it is essential to appropriately assi…
▽ More
Long Range (LoRa) wireless technology, characterized by low power consumption and a long communication range, is regarded as one of the enabling technologies for the Industrial Internet of Things (IIoT). However, as the network scale increases, the energy efficiency (EE) of LoRa networks decreases sharply due to severe packet collisions. To address this issue, it is essential to appropriately assign transmission parameters such as the spreading factor and transmission power for each end device (ED). However, due to the sporadic traffic and low duty cycle of LoRa networks, evaluating the system EE performance under different parameter settings is time-consuming. Therefore, we first formulate an analytical model to calculate the system EE. On this basis, we propose a transmission parameter allocation algorithm based on multiagent reinforcement learning (MALoRa) with the aim of maximizing the system EE of LoRa networks. Notably, MALoRa employs an attention mechanism to guide each ED to better learn how much ''attention'' should be given to the parameter assignments for relevant EDs when seeking to improve the system EE. Simulation results demonstrate that MALoRa significantly improves the system EE compared with baseline algorithms with an acceptable degradation in packet delivery rate (PDR).
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
Multimodal machine learning for materials science: composition-structure bimodal learning for experimentally measured properties
Authors:
Sheng Gong,
Shuo Wang,
Taishan Zhu,
Yang Shao-Horn,
Jeffrey C. Grossman
Abstract:
The widespread application of multimodal machine learning models like GPT-4 has revolutionized various research fields including computer vision and natural language processing. However, its implementation in materials informatics remains underexplored, despite the presence of materials data across diverse modalities, such as composition and structure. The effectiveness of machine learning models…
▽ More
The widespread application of multimodal machine learning models like GPT-4 has revolutionized various research fields including computer vision and natural language processing. However, its implementation in materials informatics remains underexplored, despite the presence of materials data across diverse modalities, such as composition and structure. The effectiveness of machine learning models trained on large calculated datasets depends on the accuracy of calculations, while experimental datasets often have limited data availability and incomplete information. This paper introduces a novel approach to multimodal machine learning in materials science via composition-structure bimodal learning. The proposed COmposition-Structure Bimodal Network (COSNet) is designed to enhance learning and predictions of experimentally measured materials properties that have incomplete structure information. Bimodal learning significantly reduces prediction errors across distinct materials properties including Li conductivity in solid electrolyte, band gap, refractive index, dielectric constant, energy, and magnetic moment, surpassing composition-only learning methods. Furthermore, we identified that data augmentation based on modal availability plays a pivotal role in the success of bimodal learning.
△ Less
Submitted 3 August, 2023;
originally announced September 2023.
-
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
Authors:
Dezhao Luo,
Jiabo Huang,
Shaogang Gong,
Hailin Jin,
Yang Liu
Abstract:
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationsh…
▽ More
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment, without the need for accessing the VMR data. To this end, we devise a conditional feature refinement module to generate boundary-aware visual features conditioned on text queries to enable better moment boundary understanding. Additionally, we design a bottom-up proposal generation strategy that mitigates the impact of domain discrepancies and breaks down complex-query retrieval tasks into individual action retrievals, thereby maximizing the benefits of VLM. Extensive experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm, especially in the novel-word and novel-location out-of-distribution setups.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality Gap
Authors:
Shiyuan Zuo,
Rongfei Fan,
Han Hu,
Ning Zhang,
Shimin Gong
Abstract:
In this paper, we propose a robust aggregation method for federated learning (FL) that can effectively tackle malicious Byzantine attacks. At each user, model parameter is firstly updated by multiple steps, which is adjustable over iterations, and then pushed to the aggregation center directly. This decreases the number of interactions between the aggregation center and users, allows each user to…
▽ More
In this paper, we propose a robust aggregation method for federated learning (FL) that can effectively tackle malicious Byzantine attacks. At each user, model parameter is firstly updated by multiple steps, which is adjustable over iterations, and then pushed to the aggregation center directly. This decreases the number of interactions between the aggregation center and users, allows each user to set training parameter in a flexible way, and reduces computation burden compared with existing works that need to combine multiple historical model parameters. At the aggregation center, geometric median is leveraged to combine the received model parameters from each user. Rigorous proof shows that zero optimality gap is achieved by our proposed method with linear convergence, as long as the fraction of Byzantine attackers is below half. Numerical results verify the effectiveness of our proposed method.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
Countering Eavesdroppers with Meta-learning-based Cooperative Ambient Backscatter Communications
Authors:
Nam H. Chu,
Nguyen Van Huynh,
Diep N. Nguyen,
Dinh Thai Hoang,
Shimin Gong,
Tao Shu,
Eryk Dutkiewicz,
Khoa T. Phan
Abstract:
This article introduces a novel lightweight framework using ambient backscattering communications to counter eavesdroppers. In particular, our framework divides an original message into two parts: (i) the active-transmit message transmitted by the transmitter using conventional RF signals and (ii) the backscatter message transmitted by an ambient backscatter tag that backscatters upon the active s…
▽ More
This article introduces a novel lightweight framework using ambient backscattering communications to counter eavesdroppers. In particular, our framework divides an original message into two parts: (i) the active-transmit message transmitted by the transmitter using conventional RF signals and (ii) the backscatter message transmitted by an ambient backscatter tag that backscatters upon the active signals emitted by the transmitter. Notably, the backscatter tag does not generate its own signal, making it difficult for an eavesdropper to detect the backscattered signals unless they have prior knowledge of the system. Here, we assume that without decoding/knowing the backscatter message, the eavesdropper is unable to decode the original message. Even in scenarios where the eavesdropper can capture both messages, reconstructing the original message is a complex task without understanding the intricacies of the message-splitting mechanism. A challenge in our proposed framework is to effectively decode the backscattered signals at the receiver, often accomplished using the maximum likelihood (MLK) approach. However, such a method may require a complex mathematical model together with perfect channel state information (CSI). To address this issue, we develop a novel deep meta-learning-based signal detector that can not only effectively decode the weak backscattered signals without requiring perfect CSI but also quickly adapt to a new wireless environment with very little knowledge. Simulation results show that our proposed learning approach, without requiring perfect CSI and complex mathematical model, can achieve a bit error ratio close to that of the MLK-based approach. They also clearly show the efficiency of the proposed approach in dealing with eavesdropping attacks and the lack of training data for deep learning models in practical scenarios.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Fake News Detection Through Graph-based Neural Networks: A Survey
Authors:
Shuzhi Gong,
Richard O. Sinnott,
Jianzhong Qi,
Cecile Paris
Abstract:
The popularity of online social networks has enabled rapid dissemination of information. People now can share and consume information much more rapidly than ever before. However, low-quality and/or accidentally/deliberately fake information can also spread rapidly. This can lead to considerable and negative impacts on society. Identifying, labelling and debunking online misinformation as early as…
▽ More
The popularity of online social networks has enabled rapid dissemination of information. People now can share and consume information much more rapidly than ever before. However, low-quality and/or accidentally/deliberately fake information can also spread rapidly. This can lead to considerable and negative impacts on society. Identifying, labelling and debunking online misinformation as early as possible has become an increasingly urgent problem. Many methods have been proposed to detect fake news including many deep learning and graph-based approaches. In recent years, graph-based methods have yielded strong results, as they can closely model the social context and propagation process of online news. In this paper, we present a systematic review of fake news detection studies based on graph-based and deep learning-based techniques. We classify existing graph-based methods into knowledge-driven methods, propagation-based methods, and heterogeneous social context-based methods, depending on how a graph structure is constructed to model news related information flows. We further discuss the challenges and open problems in graph-based fake news detection and identify future research directions.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Authors:
Chenxin An,
Shansan Gong,
Ming Zhong,
Xingjian Zhao,
Mukai Li,
Jun Zhang,
Lingpeng Kong,
Xipeng Qiu
Abstract:
Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of devel…
▽ More
Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.
△ Less
Submitted 4 October, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Dual-Functional MIMO Beamforming Optimization for RIS-Aided Integrated Sensing and Communication
Authors:
Xin Zhao,
Heng Liu,
Shiqi Gong,
Xin Ju,
Chengwen Xing,
Nan Zhao
Abstract:
Aiming at providing wireless communication systems with environment-perceptive capacity, emerging integrated sensing and communication (ISAC) technologies face multiple difficulties, especially in balancing the performance trade-off between the communication and radar functions. In this paper, we introduce a reconfigurable intelligent surface (RIS) to assist both data transmission and target detec…
▽ More
Aiming at providing wireless communication systems with environment-perceptive capacity, emerging integrated sensing and communication (ISAC) technologies face multiple difficulties, especially in balancing the performance trade-off between the communication and radar functions. In this paper, we introduce a reconfigurable intelligent surface (RIS) to assist both data transmission and target detection in a dual-functional ISAC system. To formulate a general optimization framework, diverse communication performance metrics have been taken into account including famous capacity maximization and mean-squared error (MSE) minimization. Whereas the target detection process is modeled as a general likelihood ratio test (GLRT) due to the practical limitations, and the monotonicity of the corresponding detection probability is proved. For the single-user and single-target (SUST) scenario, the minimum transmit power of the ISAC transceiver has been revealed. By exploiting the optimal conditions of the BS design, we validate that the BS is able to realize the maximum power allocation scheme and derive the optimal BS precoder in a semi-closed form. Moreover, an alternating direction method of multipliers (ADMM) based RIS design is proposed to address the optimization of unit-modulus RIS phase shifts. For the sake of further enhancing computational efficiency, we also develop a low-complexity RIS design based on Riemannian gradient descent. Furthermore, the ISAC transceiver design for the multiple-users and multiple-targets (MUMT) scenario is also investigated, where a zero-forcing (ZF) radar receiver is adopted to cancel the interferences. Then optimal BS precoder is derived under the maximum power allocation scheme, and the RIS phase shifts can be optimized by extending the proposed ADMM-based RIS design. Numerical simulation results verify the performance of our proposed transceiver designs.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Towards Optimal Neural Networks: the Role of Sample Splitting in Hyperparameter Selection
Authors:
Shijin Gong,
Xinyu Zhang
Abstract:
When artificial neural networks have demonstrated exceptional practical success in a variety of domains, investigations into their theoretical characteristics, such as their approximation power, statistical properties, and generalization performance, have concurrently made significant strides. In this paper, we construct a novel theory for understanding the effectiveness of neural networks, which…
▽ More
When artificial neural networks have demonstrated exceptional practical success in a variety of domains, investigations into their theoretical characteristics, such as their approximation power, statistical properties, and generalization performance, have concurrently made significant strides. In this paper, we construct a novel theory for understanding the effectiveness of neural networks, which offers a perspective distinct from prior research. Specifically, we explore the rationale underlying a common practice during the construction of neural network models: sample splitting. Our findings indicate that the optimal hyperparameters derived from sample splitting can enable a neural network model that asymptotically minimizes the prediction risk. We conduct extensive experiments across different application scenarios and network architectures, and the results manifest our theory's effectiveness.
△ Less
Submitted 5 October, 2023; v1 submitted 15 July, 2023;
originally announced July 2023.