Search | arXiv e-print repository

Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks

Authors: Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang

Abstract: Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities of Large Language Models (LLMs) in tackling knowledge-intensive tasks. The increasing demands of application scenarios have driven the evolution of RAG, leading to the integration of advanced retrievers, LLMs and other complementary technologies, which in turn has amplified the intricacy of RAG systems. However, the rapid… ▽ More Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities of Large Language Models (LLMs) in tackling knowledge-intensive tasks. The increasing demands of application scenarios have driven the evolution of RAG, leading to the integration of advanced retrievers, LLMs and other complementary technologies, which in turn has amplified the intricacy of RAG systems. However, the rapid advancements are outpacing the foundational RAG paradigm, with many methods struggling to be unified under the process of "retrieve-then-generate". In this context, this paper examines the limitations of the existing RAG paradigm and introduces the modular RAG framework. By decomposing complex RAG systems into independent modules and specialized operators, it facilitates a highly reconfigurable framework. Modular RAG transcends the traditional linear architecture, embracing a more advanced design that integrates routing, scheduling, and fusion mechanisms. Drawing on extensive research, this paper further identifies prevalent RAG patterns-linear, conditional, branching, and looping-and offers a comprehensive analysis of their respective implementation nuances. Modular RAG presents innovative opportunities for the conceptualization and deployment of RAG systems. Finally, the paper explores the potential emergence of new operators and paradigms, establishing a solid theoretical foundation and a practical roadmap for the continued evolution and practical deployment of RAG technologies. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.19826 [pdf, other]

Design and Control of a Novel Six-Degree-of-Freedom Hybrid Robotic Arm

Authors: Yang Chen, Zhonghua Miao, Yuanyue Ge, Sen lin, Liping Chen, Ya Xiong

Abstract: Robotic arms are key components in fruit-harvesting robots. In agricultural settings, conventional serial or parallel robotic arms often fall short in meeting the demands for a large workspace, rapid movement, enhanced capability of obstacle avoidance and affordability. This study proposes a novel hybrid six-degree-of-freedom (DoF) robotic arm that combines the advantages of parallel and serial me… ▽ More Robotic arms are key components in fruit-harvesting robots. In agricultural settings, conventional serial or parallel robotic arms often fall short in meeting the demands for a large workspace, rapid movement, enhanced capability of obstacle avoidance and affordability. This study proposes a novel hybrid six-degree-of-freedom (DoF) robotic arm that combines the advantages of parallel and serial mechanisms. Inspired by yoga, we designed two sliders capable of moving independently along a single rail, acting as two feet. These sliders are interconnected with linkages and a meshed-gear set, allowing the parallel mechanism to lower itself and perform a split to pass under obstacles. This unique feature allows the arm to avoid obstacles such as pipes, tables and beams typically found in greenhouses. Integrated with serially mounted joints, the patented hybrid arm is able to maintain the end's pose even when it moves with a mobile platform, facilitating fruit picking with the optimal pose in dynamic conditions. Moreover, the hybrid arm's workspace is substantially larger, being almost three times the volume of UR3 serial arms and fourteen times that of the ABB IRB parallel arms. Experiments show that the repeatability errors are 0.017 mm, 0.03 mm and 0.109 mm for the two sliders and the arm's end, respectively, providing sufficient precision for agricultural robots. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Accepted by IROS 2024

arXiv:2407.19254 [pdf, ps, other]

Convexity of the Bergman Kernels on Convex Domains

Authors: Yuanpu Xiong

Abstract: Let $Ωおめが$ be a convex domain in $\mathbb{C}^n$ and $\varphi$ a convex function on $Ωおめが$. We prove that $\log{K_{Ωおめが,\varphi}(z)}$ is a convex function (might be identically $-\infty$) on $Ωおめが$, where $K_{Ωおめが,\varphi}$ is the weighted Bergman kernel. In particular, $\log{K_Ωおめが(z)}$ is a convex function if $Ωおめが$ is convex. We further show that $\log{K_Ωおめが(z)}$ is strictly convex if and only if $Ωおめが$ does not contain… ▽ More Let $Ωおめが$ be a convex domain in $\mathbb{C}^n$ and $\varphi$ a convex function on $Ωおめが$. We prove that $\log{K_{Ωおめが,\varphi}(z)}$ is a convex function (might be identically $-\infty$) on $Ωおめが$, where $K_{Ωおめが,\varphi}$ is the weighted Bergman kernel. In particular, $\log{K_Ωおめが(z)}$ is a convex function if $Ωおめが$ is convex. We further show that $\log{K_Ωおめが(z)}$ is strictly convex if and only if $Ωおめが$ does not contain a real line. △ Less

Submitted 3 August, 2024; v1 submitted 27 July, 2024; originally announced July 2024.

Comments: 9 pages. Comments welcome!

arXiv:2407.18877 [pdf, other]

Code Structure-Aware through Line-level Semantic Learning for Code Vulnerability Detection

Authors: Ziliang Wang, Ge Li, Jia Li, Yihong Dong, Yingfei Xiong, Zhi Jin

Abstract: Different from the flow semantics of natural languages, programming languages are inherently rigid in structure and grammar. Existing fine-tuning methodologies for code vulnerability detection generally treat code as long text sequences, stripping away structural elements such as newlines ('/n') and whitespace. However, this approach inadvertently results in the loss of crucial structural informat… ▽ More Different from the flow semantics of natural languages, programming languages are inherently rigid in structure and grammar. Existing fine-tuning methodologies for code vulnerability detection generally treat code as long text sequences, stripping away structural elements such as newlines ('/n') and whitespace. However, this approach inadvertently results in the loss of crucial structural information, diminishing the distinct characteristics of code and impairing the accuracy of vulnerability detection. To address these challenges, we propose a novel network architecture method based on pre-trained code models, which incorporates structural information awareness. We propose an enhanced code text processing workflow that retains structural elements prior to modeling. This refinement allows the model to retain and exploit line-level structural information and semantic information during the modeling process. Furthermore, we introduce a new network architecture, the Code Structure-Aware Network through Line-level Semantic Learning (CSLS), which integrates three key components: global vulnerability awareness, line-structural awareness, and sensitive-line awareness. We have conducted comprehensive experiments using vulnerability detection datasets from real-world projects. Extensive experiments were conducted on vulnerability detection datasets derived from real-world projects. The results demonstrate that our new code pre-processing flow significantly improves existing baselines (e.g., a 3\% accuracy improvement on the Devign dataset when applied to popular models such as CoderBert and UniXcoder). The proposed network architecture also demonstrates superior accuracy in detecting vulnerabilities, surpassing newly established benchmarks. These findings underscore the importance of structural information in enhancing the efficacy of code vulnerability detection models. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.18523 [pdf, other]

DTFormer: A Transformer-Based Method for Discrete-Time Dynamic Graph Representation Learning

Authors: Xi Chen, Yun Xiong, Siwei Zhang, Jiawei Zhang, Yao Zhang, Shiyang Zhou, Xixi Wu, Mingyang Zhang, Tengfei Liu, Weiqiang Wang

Abstract: Discrete-Time Dynamic Graphs (DTDGs), which are prevalent in real-world implementations and notable for their ease of data acquisition, have garnered considerable attention from both academic researchers and industry practitioners. The representation learning of DTDGs has been extensively applied to model the dynamics of temporally changing entities and their evolving connections. Currently, DTDG… ▽ More Discrete-Time Dynamic Graphs (DTDGs), which are prevalent in real-world implementations and notable for their ease of data acquisition, have garnered considerable attention from both academic researchers and industry practitioners. The representation learning of DTDGs has been extensively applied to model the dynamics of temporally changing entities and their evolving connections. Currently, DTDG representation learning predominantly relies on GNN+RNN architectures, which manifest the inherent limitations of both Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs). GNNs suffer from the over-smoothing issue as the models architecture goes deeper, while RNNs struggle to capture long-term dependencies effectively. GNN+RNN architectures also grapple with scaling to large graph sizes and long sequences. Additionally, these methods often compute node representations separately and focus solely on individual node characteristics, thereby overlooking the behavior intersections between the two nodes whose link is being predicted, such as instances where the two nodes appear together in the same context or share common neighbors. This paper introduces a novel representation learning method DTFormer for DTDGs, pivoting from the traditional GNN+RNN framework to a Transformer-based architecture. Our approach exploits the attention mechanism to concurrently process topological information within the graph at each timestamp and temporal dynamics of graphs along the timestamps, circumventing the aforementioned fundamental weakness of both GNNs and RNNs. Moreover, we enhance the model's expressive capability by incorporating the intersection relationships among nodes and integrating a multi-patching module. Extensive experiments conducted on six public dynamic graph benchmark datasets confirm our model's efficacy, achieving the SOTA performance. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 11 pages, 3 figures

arXiv:2407.17337 [pdf, ps, other]

Raman Spectroscopic Study on Bi2Rh3Se2: Two-dimensional-Ising Charge Density Wave and Quantum Fluctuations

Authors: Fei Jiao, Yonghui Zhou, Shuyang Wang, Chao An, Xuliang Chen, Ying Zhou, Min Zhang, Liang Cao, Xigang Luo, Yimin Xiong, Zhaorong Yang

Abstract: The ternary chalcogenide Bi2Rh3Se2 was found to be a charge density wave (CDW) superconductor with a 2*2 periodicity. The key questions regarding the underlying mechanism of CDW state and its interplay with lattice and electronic properties remains to be explored. Here, based on the systematic Raman scattering investigations on single crystalline Bi2Rh3Se2, we observed the fingerprinting feature o… ▽ More The ternary chalcogenide Bi2Rh3Se2 was found to be a charge density wave (CDW) superconductor with a 2*2 periodicity. The key questions regarding the underlying mechanism of CDW state and its interplay with lattice and electronic properties remains to be explored. Here, based on the systematic Raman scattering investigations on single crystalline Bi2Rh3Se2, we observed the fingerprinting feature of CDW state, a collective amplitude mode at 39 cm-1. The temperature evolution of Raman shift and line width for this amplitude mode can be well described by the critical behavior of two-dimensional (2D) Ising model, suggesting the interlayer interactions of Bi2Rh3Se2 is negligible when CDW state is formed, as a consequence, the quantum fluctuations play an important role at low temperature. Moreover, temperature dependence of Raman shift for Ag9 mode deviates significantly from the expected anharmonic behavior when approaching the CDW transition temperature 240 K, demonstrated that strong electron-phonon coupling plays a key role in the formation of CDW. Our results reveal that Bi2Rh3Se2 is an intriguing quasi-2D system to explore electronic quantum phase transition and modulate the correlations between CDW and superconductivity. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.16308 [pdf, other]

SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging

Authors: Lingtong Kong, Bo Li, Yike Xiong, Hao Zhang, Hong Gu, Jinwei Chen

Abstract: Multi-exposure High Dynamic Range (HDR) imaging is a challenging task when facing truncated texture and complex motion. Existing deep learning-based methods have achieved great success by either following the alignment and fusion pipeline or utilizing attention mechanism. However, the large computation cost and inference delay hinder them from deploying on resource limited devices. In this paper,… ▽ More Multi-exposure High Dynamic Range (HDR) imaging is a challenging task when facing truncated texture and complex motion. Existing deep learning-based methods have achieved great success by either following the alignment and fusion pipeline or utilizing attention mechanism. However, the large computation cost and inference delay hinder them from deploying on resource limited devices. In this paper, to achieve better efficiency, a novel Selective Alignment Fusion Network (SAFNet) for HDR imaging is proposed. After extracting pyramid features, it jointly refines valuable area masks and cross-exposure motion in selected regions with shared decoders, and then fuses high quality HDR image in an explicit way. This approach can focus the model on finding valuable regions while estimating their easily detectable and meaningful motion. For further detail enhancement, a lightweight refine module is introduced which enjoys privileges from previous optical flow, selection masks and initial prediction. Moreover, to facilitate learning on samples with large motion, a new window partition cropping method is presented during training. Experiments on public and newly developed challenging datasets show that proposed SAFNet not only exceeds previous SOTA competitors quantitatively and qualitatively, but also runs order of magnitude faster. Code and dataset is available at https://github.com/ltkong218/SAFNet. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.15530 [pdf, ps, other]

Pulse Shaping for Random ISAC Signals: The Ambiguity Function Between Symbols Matters

Authors: Zihan Liao, Fan Liu, Shuangyang Li, Yifeng Xiong, Weijie Yuan, Christos Masouros, Marco Lops

Abstract: Integrated sensing and communications (ISAC) has emerged as a pivotal enabling technology for next-generation wireless networks. Despite the distinct signal design requirements of sensing and communication (S&C) systems, shifting the symbol-wise pulse shaping (SWiPS) framework from communication-only systems to ISAC poses significant challenges in signal design and processing This paper addresses… ▽ More Integrated sensing and communications (ISAC) has emerged as a pivotal enabling technology for next-generation wireless networks. Despite the distinct signal design requirements of sensing and communication (S&C) systems, shifting the symbol-wise pulse shaping (SWiPS) framework from communication-only systems to ISAC poses significant challenges in signal design and processing This paper addresses these challenges by examining the ambiguity function (AF) of the SWiPS ISAC signal and introducing a novel pulse shaping design for single-carrier ISAC transmission. We formulate optimization problems to minimize the average integrated sidelobe level (ISL) of the AF, as well as the weighted ISL (WISL) while satisfying inter-symbol interference (ISI), out-of-band emission (OOBE), and power constraints. Our contributions include establishing the relationship between the AFs of both the random data symbols and signaling pulses, analyzing the statistical characteristics of the AF, and developing algorithmic frameworks for pulse shaping optimization using successive convex approximation (SCA) and alternating direction method of multipliers (ADMM) approaches. Numerical results are provided to validate our theoretical analysis, which demonstrate significant performance improvements in the proposed SWiPS design compared to the root-raised cosine (RRC) pulse shaping for conventional communication systems. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.14053 [pdf, other]

DirectL: Efficient Radiance Fields Rendering for 3D Light Field Displays

Authors: Zongyuan Yang, Baolin Liu, Yingde Song, Yongping Xiong, Lan Yi, Zhaohe Zhang, Xunbo Yu

Abstract: Autostereoscopic display, despite decades of development, has not achieved extensive application, primarily due to the daunting challenge of 3D content creation for non-specialists. The emergence of Radiance Field as an innovative 3D representation has markedly revolutionized the domains of 3D reconstruction and generation. This technology greatly simplifies 3D content creation for common users, b… ▽ More Autostereoscopic display, despite decades of development, has not achieved extensive application, primarily due to the daunting challenge of 3D content creation for non-specialists. The emergence of Radiance Field as an innovative 3D representation has markedly revolutionized the domains of 3D reconstruction and generation. This technology greatly simplifies 3D content creation for common users, broadening the applicability of Light Field Displays (LFDs). However, the combination of these two fields remains largely unexplored. The standard paradigm to create optimal content for parallax-based light field displays demands rendering at least 45 slightly shifted views preferably at high resolution per frame, a substantial hurdle for real-time rendering. We introduce DirectL, a novel rendering paradigm for Radiance Fields on 3D displays. We thoroughly analyze the interweaved mapping of spatial rays to screen subpixels, precisely determine the light rays entering the human eye, and propose subpixel repurposing to significantly reduce the pixel count required for rendering. Tailored for the two predominant radiance fields--Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS), we propose corresponding optimized rendering pipelines that directly render the light field images instead of multi-view images. Extensive experiments across various displays and user study demonstrate that DirectL accelerates rendering by up to 40 times compared to the standard paradigm without sacrificing visual quality. Its rendering process-only modification allows seamless integration into subsequent radiance field tasks. Finally, we integrate DirectL into diverse applications, showcasing the stunning visual experiences and the synergy between LFDs and Radiance Fields, which unveils tremendous potential for commercialization applications. \href{direct-l.github.io}{\textbf{Project Homepage} △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.13976 [pdf, other]

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Authors: Shuo Huang, Shikun Sun, Zixuan Wang, Xiaoyu Qin, Yanmin Xiong, Yuan Zhang, Pengfei Wan, Di Zhang, Jia Jia

Abstract: Recently, text-to-3D generation has attracted significant attention, resulting in notable performance enhancements. Previous methods utilize end-to-end 3D generation models to initialize 3D Gaussians, multi-view diffusion models to enforce multi-view consistency, and text-to-image diffusion models to refine details with score distillation algorithms. However, these methods exhibit two limitations.… ▽ More Recently, text-to-3D generation has attracted significant attention, resulting in notable performance enhancements. Previous methods utilize end-to-end 3D generation models to initialize 3D Gaussians, multi-view diffusion models to enforce multi-view consistency, and text-to-image diffusion models to refine details with score distillation algorithms. However, these methods exhibit two limitations. Firstly, they encounter conflicts in generation directions since different models aim to produce diverse 3D assets. Secondly, the issue of over-saturation in score distillation has not been thoroughly investigated and solved. To address these limitations, we propose PlacidDreamer, a text-to-3D framework that harmonizes initialization, multi-view generation, and text-conditioned generation with a single multi-view diffusion model, while simultaneously employing a novel score distillation algorithm to achieve balanced saturation. To unify the generation direction, we introduce the Latent-Plane module, a training-friendly plug-in extension that enables multi-view diffusion models to provide fast geometry reconstruction for initialization and enhanced multi-view images to personalize the text-to-image diffusion model. To address the over-saturation problem, we propose to view score distillation as a multi-objective optimization problem and introduce the Balanced Score Distillation algorithm, which offers a Pareto Optimal solution that achieves both rich details and balanced saturation. Extensive experiments validate the outstanding capabilities of our PlacidDreamer. The code is available at \url{https://github.com/HansenHuang0823/PlacidDreamer}. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: Accepted by ACM Multimedia 2024

ACM Class: I.4.0

arXiv:2407.13193 [pdf, other]

Retrieval-Augmented Generation for Natural Language Processing: A Survey

Authors: Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Abstract: Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database… ▽ More Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database to augment LLMs, makes up those drawbacks of LLMs. This paper reviews all significant techniques of RAG, especially in the retriever and the retrieval fusions. Besides, tutorial codes are provided for implementing the representative techniques in RAG. This paper further discusses the RAG training, including RAG with/without datastore update. Then, we introduce the application of RAG in representative natural language processing tasks and industrial scenarios. Finally, this paper discusses the future directions and challenges of RAG for promoting its development. △ Less

Submitted 18 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.13168 [pdf, other]

SciCode: A Research Coding Benchmark Curated by Scientists

Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du , et al. (5 additional authors not shown)

Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields,… ▽ More Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 25 pages, 9 figures, 7 tables

arXiv:2407.12532 [pdf, other]

Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models

Authors: Xihe Qiu, Haoyu Wang, Xiaoyu Tan, Chao Qu, Yujie Xiong, Yuan Cheng, Yinghui Xu, Wei Chu, Yuan Qi

Abstract: Effective collaboration in multi-agent systems requires communicating goals and intentions between agents. Current agent frameworks often suffer from dependencies on single-agent execution and lack robust inter-module communication, frequently leading to suboptimal multi-agent reinforcement learning (MARL) policies and inadequate task coordination. To address these challenges, we present a framewo… ▽ More Effective collaboration in multi-agent systems requires communicating goals and intentions between agents. Current agent frameworks often suffer from dependencies on single-agent execution and lack robust inter-module communication, frequently leading to suboptimal multi-agent reinforcement learning (MARL) policies and inadequate task coordination. To address these challenges, we present a framework for training large language models (LLMs) as collaborative agents to enable coordinated behaviors in cooperative MARL. Each agent maintains a private intention consisting of its current goal and associated sub-tasks. Agents broadcast their intentions periodically, allowing other agents to infer coordination tasks. A propagation network transforms broadcast intentions into teammate-specific communication messages, sharing relevant goals with designated teammates. The architecture of our framework is structured into planning, grounding, and execution modules. During execution, multiple agents interact in a downstream environment and communicate intentions to enable coordinated behaviors. The grounding module dynamically adapts comprehension strategies based on emerging coordination patterns, while feedback from execution agents influnces the planning module, enabling the dynamic re-planning of sub-tasks. Results in collaborative environment simulation demonstrate intention propagation reduces miscoordination errors by aligning sub-task dependencies between agents. Agents learn when to communicate intentions and which teammates require task details, resulting in emergent coordinated behaviors. This demonstrates the efficacy of intention sharing for cooperative multi-agent RL based on LLMs. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.12096 [pdf, other]

Skew-scattering Pockels effect and metallic electro-optics in gapped bilayer graphene

Authors: Da Ma, Ying Xiong, Justin C. W. Song

Abstract: We argue that a range of strong metallic electro-optic (EO) effects can be naturally realized from non-Drude dynamics of free carriers in metals. In particular, in clean metals we identify skew-scattering and a "Snap" (third-order derivative of velocity) dominating the Pockels and Kerr EO behavior of metals in the clean limit. Strikingly, we find that both Pockels and Kerr EO in metals play critic… ▽ More We argue that a range of strong metallic electro-optic (EO) effects can be naturally realized from non-Drude dynamics of free carriers in metals. In particular, in clean metals we identify skew-scattering and a "Snap" (third-order derivative of velocity) dominating the Pockels and Kerr EO behavior of metals in the clean limit. Strikingly, we find that both Pockels and Kerr EO in metals play critical roles in metallic EO phenomena: for instance, metallic Pockels and Kerr EO effectively compete to produce a field-activated birefringence that is non-reciprocal in applied DC fields. Similarly, both contribute to sizeable field-induced modulations to transmission and reflection across a range of frequencies. We find metallic EO effects can be naturally realized in layered 2D materials such as gapped bilayer graphene producing pronounced values of EO coefficients in the terahertz -- an interesting new metallic platform for terahertz electro-optic modulation. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10195 [pdf, other]

V2I-Calib: A Novel Calibration Approach for Collaborative Vehicle and Infrastructure LiDAR Systems

Authors: Qianxin Qu, Yijin Xiong, Xin Wu, Hanyu Li, Shichun Guo

Abstract: Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a… ▽ More Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a novel calibration method for cooperative vehicle and road infrastructure LiDAR systems, which exploits spatial association information between detection boxes. The method centers around a novel Overall IoU metric that reflects the correlation of targets between vehicle and infrastructure, enabling real-time monitoring of calibration results. We search for common matching boxes between vehicle and infrastructure nodes by constructing an affinity matrix. Subsequently, these matching boxes undergo extrinsic parameter computation and optimization. Comparative and ablation experiments on the DAIR-V2X dataset confirm the superiority of our method. To better reflect the differences in calibration results, we have categorized the calibration tasks on the DAIR-V2X dataset based on their level of difficulty, enriching the dataset's utility for future research. Our project is available at https://github.com/MassimoQu/v2i-calib . △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: to be published in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS2024)

arXiv:2407.09932 [pdf, other]

Quantum Clock Synchronization Network with Silicon-chip Dual-Pumped Entangled Photon Source

Authors: J. A. Li, H. Han, X. P. Huang, B. Y. Tang, K. Guo, J. Q. Huang, S. Y. Xiong, W. R. Yu, Z. J. Zhang, J. B. Yang, B. Liu, H. Chen, Z. K. Lu

Abstract: In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to… ▽ More In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to remote users through the wavelength division multiplexing strategy to construct an entanglement distribution network, and the round-trip QCS is adopted to realize a QCS network that can serve multiple users. A proof-of-principle QCS network experiment is implemented among the server and multiple users (Alice, Bob, and Charlie) for 11.1 hours, where Alice and Charlie are 10 km away from the server and Bob is 25 km away from the server. The lowest time deviations (TDEV) between the server and each user (Alice, Bob, and Charlie) are 1.57 ps, 0.82 ps and 2.57 ps at the average time of 8000 s, 8000 s and 800 s respectively. The results show that the QCS network scheme with dual-pumped SFWM photon source proposed by us achieves high accuracy, and the channel resources used by n users are reduced by about 30% compared with other round-trip QCS schemes. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.09816 [pdf, other]

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

Authors: Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu

Abstract: Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens… ▽ More Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning by employing a routing \textbf{mask}ing technique within the \textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance. △ Less

Submitted 28 July, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

Comments: Work in progress

arXiv:2407.09761 [pdf, other]

Exploring Differences between Two Decades of Mental Health Related Emergency Department Visits by Youth via Recurrent Events Analyses

Authors: Yi Xiong, Joan Hu, Rhonda Rosychuk

Abstract: We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the d… ▽ More We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the data are usually not in a conventional format since administrative databases are created and maintained to serve non-research purposes and only information for people who seek health services is accessible. Analysis of administrative health data is thus challenging in general. In the MHED data analyses, we are particularly concerned with (i) evaluating dynamic patterns and impacts with doubly-censored recurrent event data, and (ii) re-calibrating estimators developed based on truncated data by leveraging summary statistics from the population. The findings are verified empirically via simulation. We have established the asymptotic properties of the inference procedures. The contributions of this paper are twofold. We present innovative strategies for processing doubly-censored recurrent event data, and overcoming the truncation induced by the data collection. In addition, through exploring the pediatric MHED visit records, we provide new insights into children/youths mental health changes over time. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.06691 [pdf, other]

OFDM Achieves the Lowest Ranging Sidelobe Under Random ISAC Signaling

Authors: Fan Liu, Ying Zhang, Yifeng Xiong, Shuangyang Li, Weijie Yuan, Feifei Gao, Shi Jin, Giuseppe Caire

Abstract: This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated thei… ▽ More This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated their ranging performance by adopting both the periodic and aperiodic auto-correlation functions (P-ACF and A-ACF), and defined the expectation of the integrated sidelobe level (EISL) as a sensing performance metric. On top of that, we proved that among all communication waveforms with cyclic prefix (CP), the orthogonal frequency division multiplexing (OFDM) modulation is the only globally optimal waveform that achieves the lowest ranging sidelobe for quadrature amplitude modulation (QAM) and phase shift keying (PSK) constellations, in terms of both the EISL and the sidelobe level at each individual lag of the P-ACF. As a step forward, we proved that among all communication waveforms without CP, OFDM is a locally optimal waveform for QAM/PSK in the sense that it achieves a local minimum of the EISL of the A-ACF. Finally, we demonstrated by numerical results that under QAM/PSK constellations, there is no other orthogonal communication-centric waveform that achieves a lower ranging sidelobe level than that of the OFDM, in terms of both P-ACF and A-ACF cases. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 14 pages, 12 figures, submitted to IEEE for possible publication

arXiv:2407.06358 [pdf, other]

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Authors: Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

Abstract: Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video datase… ▽ More Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04954 [pdf, other]

Extremely Large-Scale Dynamic Metasurface Antennas (XL-DMAs): Near-Field Modeling and Channel Estimation

Authors: Songjie Yang, Wanting Lyu, Boyu Ning, Yue Xiu, Youzhi Xiong, Hua Chen, Chadi Assi, Chau Yuen

Abstract: Dynamic metasurface antennas (DMAs) represent a novel transceiver array architecture for extremely large-scale (XL) communications, offering the advantages of reduced power consumption and lower hardware costs compared to conventional arrays. This paper focuses on near-field channel estimation for XL-DMAs. We begin by analyzing the near-field characteristics of uniform planar arrays (UPAs) and i… ▽ More Dynamic metasurface antennas (DMAs) represent a novel transceiver array architecture for extremely large-scale (XL) communications, offering the advantages of reduced power consumption and lower hardware costs compared to conventional arrays. This paper focuses on near-field channel estimation for XL-DMAs. We begin by analyzing the near-field characteristics of uniform planar arrays (UPAs) and introducing the Oblong Approx. model. This model decouples elevation-azimuth (EL-AZ) parameters for XL-DMAs, providing an effective means to characterize the near-field effect. It offers simpler mathematical expressions than the second-order Taylor expansion model, all while maintaining negligible model errors for oblong-shaped arrays. Building on the Oblong Approx. model, we propose an EL-AZ-decoupled estimation framework that involves near- and far-field parameter estimation for AZ/EL and EL/AZ directions, respectively. The former is formulated as a distributed compressive sensing problem, addressed using the proposed off-grid distributed orthogonal least squares algorithm, while the latter involves a straightforward parallelizable search. Crucially, we illustrate the viability of decoupled EL-AZ estimation for near-field UPAs, exhibiting commendable performance and linear complexity correlated with the number of metasurface elements. Moreover, we design an measurement matrix optimization method with the Lorentzian constraint on DMAs and highlight the estimation performance degradation resulting from this constraint. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2407.01296 [pdf, other]

Non-Hermitian skin effect in arbitrary dimensions: non-Bloch band theory and classification

Authors: Yuncheng Xiong, Ze-Yu Xing, Haiping Hu

Abstract: Non-Hermitian skin effect (NHSE) is a distinctive phenomenon in non-Hermitian systems, characterized by a significant accumulation of eigenstates at system boundaries. While well-understood in one dimension via non-Bloch band theory, unraveling the NHSE in higher dimensions faces formidable challenges due to the diversity of open boundary conditions or lattice geometries and inevitable numerical e… ▽ More Non-Hermitian skin effect (NHSE) is a distinctive phenomenon in non-Hermitian systems, characterized by a significant accumulation of eigenstates at system boundaries. While well-understood in one dimension via non-Bloch band theory, unraveling the NHSE in higher dimensions faces formidable challenges due to the diversity of open boundary conditions or lattice geometries and inevitable numerical errors. Key issues, including higher-dimensional non-Bloch band theory, geometric dependency, spectral convergence and stability, and a complete classification of NHSE, remain elusive. In this work, we address these challenges by presenting a geometry-adaptive non-Bloch band theory in arbitrary dimensions, through the lens of spectral potential. Our formulation accurately determines the energy spectra, density of states, and generalized Brillouin zone for a given geometry in the thermodynamic limit (TDL), revealing their geometric dependencies. Furthermore, we systematically classify the NHSE into critical and non-reciprocal types using net winding numbers. In the critical case, we identify novel scale-free skin modes residing on the boundary. In the nonreciprocal case, the skin modes manifest in various forms, including normal or anomalous corner modes, boundary modes or scale-free modes. We reveal the non-convergence and instability of the non-Bloch spectra in the presence of scale-free modes and attribute it to the non-exchangeability of the zero-perturbation limit and the TDL. The instability drives the energy spectra towards the Amoeba spectra in the critical case. Our findings provide a unified non-Bloch band theory governing the energy spectra, density of states, and generalized Brillouin zone in the TDL, offering a comprehensive understanding of NHSE in arbitrary dimensions. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 24 pages, 17 figures

arXiv:2406.19444 [pdf, other]

Kramers Nonlinearity in PT Symmetric Magnets

Authors: Oles Matsyshyn, Ying Xiong, Justin C. W. Song

Abstract: Kramers degeneracies play an essential role in the spectrum of electronic materials. Here we argue that beyond spectral properties, Kramers degeneracy plays a critical role in the nonlinear response of PT symmetric magnets. In particular, we uncover a class of second-order Kramers nonlinearities that only arise in the presence of Kramers degeneracy, vanishing in non-degenerate PT symmetric materia… ▽ More Kramers degeneracies play an essential role in the spectrum of electronic materials. Here we argue that beyond spectral properties, Kramers degeneracy plays a critical role in the nonlinear response of PT symmetric magnets. In particular, we uncover a class of second-order Kramers nonlinearities that only arise in the presence of Kramers degeneracy, vanishing in non-degenerate PT symmetric materials. Kramers nonlinearties depend on a circular dichroism between PT related Kramers states and enable to trace out the quantum geometry of the degenerate band structure. We find pronounced Kramers nonlinearitites in the nonlinear polarization responses of even layer antiferromagnetic MnBi$_2$Te$_4$ that enable to identify its antiferromagnetic order. This provides novel means for diagnosing Kramers pairs and addressing the internal Kramers degree of freedom. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 9 pages, 2 figures

arXiv:2406.18485 [pdf, other]

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Authors: Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mecha… ▽ More Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.12187 [pdf, other]

Diverse Responses in Lattice Thermal Conductivity of $n$-type/$p$-type Semiconductors Driven by Asymmetric Electron-Phonon Interactions

Authors: Jianshi Sun, Shouhang Li, Zhen Tong, Cheng Shao, Han Xie, Meng An, Chuang Zhang, Xiongfei Zhu, Chen Huang, Yucheng Xiong, Xiangjun Liu

Abstract: Accurately assessing the impact of electron-phonon interaction (EPI) on the lattice thermal conductivity of semiconductors is crucial for the thermal management of electronic devices and a unified physical understanding of this issue is highly desired. In this work, we predict the lattice thermal conductivities of typical direct and indirect bandgap semiconductors accounting for EPI based on mode-… ▽ More Accurately assessing the impact of electron-phonon interaction (EPI) on the lattice thermal conductivity of semiconductors is crucial for the thermal management of electronic devices and a unified physical understanding of this issue is highly desired. In this work, we predict the lattice thermal conductivities of typical direct and indirect bandgap semiconductors accounting for EPI based on mode-level first-principles calculations. It is found that EPI has a larger effect on the lattice thermal conductivity of $p$-type doping compared to $n$-type doping in the same semiconductor at high charge carrier concentrations. The stronger EPI in $p$-type doping is attributed to the relatively higher electron density of states caused by the relatively larger $p$-orbital component. Furthermore, EPI has a stronger influence on the lattice thermal conductivity of $n$-type indirect bandgap semiconductors than $n$-type direct bandgap semiconductors. This is attributed to the relatively lower electron density of states in direct bandgap semiconductors stemming from the $s$-orbital component. This work reveals that there exist diverse responses in lattice thermal conductivity of $n$-type/$p$-type semiconductors, which can be attributed to asymmetric EPIs. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 8 pages,5 figures

arXiv:2406.11891 [pdf, other]

Towards Adaptive Neighborhood for Advancing Temporal Interaction Graph Modeling

Authors: Siwei Zhang, Xi Chen, Yun Xiong, Xixi Wu, Yao Zhang, Yongrui Fu, Yinglong Zhao, Jiawei Zhang

Abstract: Temporal Graph Networks (TGNs) have demonstrated their remarkable performance in modeling temporal interaction graphs. These works can generate temporal node representations by encoding the surrounding neighborhoods for the target node. However, an inherent limitation of existing TGNs is their reliance on fixed, hand-crafted rules for neighborhood encoding, overlooking the necessity for an adaptiv… ▽ More Temporal Graph Networks (TGNs) have demonstrated their remarkable performance in modeling temporal interaction graphs. These works can generate temporal node representations by encoding the surrounding neighborhoods for the target node. However, an inherent limitation of existing TGNs is their reliance on fixed, hand-crafted rules for neighborhood encoding, overlooking the necessity for an adaptive and learnable neighborhood that can accommodate both personalization and temporal evolution across different timestamps. In this paper, we aim to enhance existing TGNs by introducing an adaptive neighborhood encoding mechanism. We present SEAN, a flexible plug-and-play model that can be seamlessly integrated with existing TGNs, effectively boosting their performance. To achieve this, we decompose the adaptive neighborhood encoding process into two phases: (i) representative neighbor selection, and (ii) temporal-aware neighborhood information aggregation. Specifically, we propose the Representative Neighbor Selector component, which automatically pinpoints the most important neighbors for the target node. It offers a tailored understanding of each node's unique surrounding context, facilitating personalization. Subsequently, we propose a Temporal-aware Aggregator, which synthesizes neighborhood aggregation by selectively determining the utilization of aggregation routes and decaying the outdated information, allowing our model to adaptively leverage both the contextually significant and current information during aggregation. We conduct extensive experiments by integrating SEAN into three representative TGNs, evaluating their performance on four public datasets and one financial benchmark dataset introduced in this paper. The results demonstrate that SEAN consistently leads to performance improvements across all models, achieving SOTA performance and exceptional robustness. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: KDD'2024 Research Track Paper

arXiv:2406.11836 [pdf, other]

RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians

Authors: Bingling Li, Shengyi Chen, Luchao Wang, Kaimin Liao, Sijie Yan, Yuanjun Xiong

Abstract: In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS i… ▽ More In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS in terms of primitive numbers and training resolutions that were difficult to explore before and surpass previous state-of-the-art reconstruction quality. We observe a clear positive trend of increasing visual quality when increasing primitive numbers with our method. We also demonstrate the first attempt at training a 3DGS model with more than one billion primitives on the full MatrixCity dataset that attains a promising visual quality. △ Less

Submitted 22 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11833 [pdf, other]

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: This project is available at https://github.com/Liuziyu77/MMDU

arXiv:2406.08717 [pdf, other]

Comparison of superconducting pairing in doped cuprates and nickelates within an extended Hubbard model

Authors: Yicheng Xiong, Hang Ma, Hongxing Liu, Runyu Ma, Tianxing Ma

Abstract: Within an extended Hubbard model, we investigate the superconducting pairing behavior of infinite-layer nickelate $\mathrm{NdNiO_2}$ and cuprates superconductors by using the determinant quantum Monte Carlo method. Our focus is on comparing their dominant pairing symmetries. The results indicate that the $d_{x^2-y^2}$ pairing interaction is significantly enhanced at low temperatures in both doped… ▽ More Within an extended Hubbard model, we investigate the superconducting pairing behavior of infinite-layer nickelate $\mathrm{NdNiO_2}$ and cuprates superconductors by using the determinant quantum Monte Carlo method. Our focus is on comparing their dominant pairing symmetries. The results indicate that the $d_{x^2-y^2}$ pairing interaction is significantly enhanced at low temperatures in both doped nickelates and cuprates, while other typical pairing symmetries are effectively suppressed, highlighting the dominance of $d_{x^2-y^2}$ pairing form. Additionally, we find that the effective pairing interaction for $d_{x^2-y^2}$ pairing in doped nickelates is slightly lower than that in doped cuprates, which may be attributed to the different degrees of Fermi surface warping caused by the third nearest hopping $t''$. Further studies show that the hole doping and interaction strength have significant effects on the $d_{x^2-y^2}$ pairing interaction within the selected parameter range. The $d_{x^2-y^2}$ pairing interaction is notably weakened when the hole doping increases, while it is significantly enhanced with increasing Coulomb interaction strength $U$. This comparative analysis reveals the similarities and differences in the pairing behaviors of doped nickelates and cuprates, which may provide further insights into understanding the superconducting properties of these two classes of materials. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 6 pages and 8 figures

arXiv:2406.07548 [pdf, other]

Image and Video Tokenization with Binary Spherical Quantization

Authors: Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

Abstract: We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with m… ▽ More We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Tech report

arXiv:2406.06609 [pdf, other]

Mitigating Bias in Dataset Distillation

Authors: Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

Abstract: Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color… ▽ More Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset distillation and provide a promising avenue to address bias amplification in the process. △ Less

Submitted 10 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: ICML

arXiv:2406.06069 [pdf, other]

PointABM:Integrating Bidirectional State Space Model with Multi-Head Self-Attention for Point Cloud Analysis

Authors: Jia-wei Chen, Yu-jie Xiong, Yong-bin Gao

Abstract: Mamba, based on state space model (SSM) with its linear complexity and great success in classification provide its superiority in 3D point cloud analysis. Prior to that, Transformer has emerged as one of the most prominent and successful architectures for point cloud analysis. We present PointABM, a hybrid model that integrates the Mamba and Transformer architectures for enhancing local feature to… ▽ More Mamba, based on state space model (SSM) with its linear complexity and great success in classification provide its superiority in 3D point cloud analysis. Prior to that, Transformer has emerged as one of the most prominent and successful architectures for point cloud analysis. We present PointABM, a hybrid model that integrates the Mamba and Transformer architectures for enhancing local feature to improve performance of 3D point cloud analysis. In order to enhance the extraction of global features, we introduce a bidirectional SSM (bi-SSM) framework, which comprises both a traditional token forward SSM and an innovative backward SSM. To enhance the bi-SSM's capability of capturing more comprehensive features without disrupting the sequence relationships required by the bidirectional Mamba, we introduce Transformer, utilizing its self-attention mechanism to process point clouds. Extensive experimental results demonstrate that integrating Mamba with Transformer significantly enhance the model's capability to analysis 3D point cloud. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.05940 [pdf, other]

M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection

Authors: Ziliang Wang, Ge Li, Jia Li, Yingfei Xiong, Jia Li, Meng Yan, Zhi Jin

Abstract: Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborativ… ▽ More Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) that leverages the strong capability of analyzing vulnerability semantics from LLMs to improve the detection accuracy of code models. M2CVD employs a novel collaborative process: first enhancing the quality of vulnerability semantic description produced by LLMs through the understanding of project code by code models, and then using these improved vulnerability semantic description to boost the detection accuracy of code models. We demonstrated M2CVD's effectiveness on two real-world datasets, where M2CVD significantly outperformed the baseline. In addition, we demonstrate that the M2CVD collaborative method can extend to other different LLMs and code models to improve their accuracy in vulnerability detection tasks. △ Less

Submitted 19 July, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.04292 [pdf, other]

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Authors: Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong

Abstract: Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal mul… ▽ More Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 main conference

arXiv:2406.04264 [pdf, other]

MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

Authors: Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu

Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres… ▽ More The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs. △ Less

Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.02874 [pdf, other]

Giant enhancement of hole mobility for 4H-silicon carbide through suppressing interband electron-phonon scattering

Authors: Jianshi Sun, Shouhang Li, Zhen Tong, Cheng Shao, Meng An, Xiongfei Zhu, Chuang Zhang, Xiangchuan Chen, Yucheng Xiong, Thomas Frauenheim, Xiangjun Liu

Abstract: 4H-Silicon Carbide (4H-SiC) possesses a high Baliga figure of merit, making it a promising material for power electronics. However, its applications are limited by its low hole mobility. Herein, we found that the hole mobility of 4H-SiC is mainly limited by the strong interband electron-phonon scattering using mode-level first-principles calculations. Our research indicates that applying compressi… ▽ More 4H-Silicon Carbide (4H-SiC) possesses a high Baliga figure of merit, making it a promising material for power electronics. However, its applications are limited by its low hole mobility. Herein, we found that the hole mobility of 4H-SiC is mainly limited by the strong interband electron-phonon scattering using mode-level first-principles calculations. Our research indicates that applying compressive strain can reverse the sign of crystal-field splitting and change the ordering of electron bands close to the valence band maximum. Therefore, the interband electron-phonon scattering is severely suppressed, and the out-of-plane hole mobility of 4H-SiC can be enhanced by 200% with 2% uniaxial compressive strain applied. This work provides new insights into the electron transport mechanisms in semiconductors and suggests a strategy to improve hole mobility that could be applied to other semiconductors with hexagonal crystalline geometries. △ Less

Submitted 20 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: 22 pages, 4 figures

arXiv:2406.00093 [pdf, other]

Bootstrap3D: Improving 3D Content Creation with Synthetic Data

Authors: Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

Abstract: Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically… ▽ More Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency. △ Less

Submitted 31 May, 2024; originally announced June 2024.

Comments: Project Page: https://sunzey.github.io/Bootstrap3D/

arXiv:2405.19731 [pdf, other]

Some New Approaches to MPI Implementations

Authors: Yuqing Xiong

Abstract: This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI. This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19487 [pdf, other]

A Full-duplex Speech Dialogue Scheme Based On Large Language Models

Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Yuanjun Xiong, Wei Xia

Abstract: We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allo… ▽ More We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.19119 [pdf, other]

Can Graph Learning Improve Task Planning?

Authors: Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, Dongsheng Li

Abstract: Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, t… ▽ More Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs' ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. Additionally, our approach complements prompt engineering and fine-tuning techniques, with performance further enhanced by improved prompts or a fine-tuned model. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.17463 [pdf, other]

No Algorithmic Collusion in Two-Player Blindfolded Game with Thompson Sampling

Authors: Ningyuan Chen, Xuefeng Gao, Yi Xiong

Abstract: When two players are engaged in a repeated game with unknown payoff matrices, they may be completely unaware of the existence of each other and use multi-armed bandit algorithms to choose the actions, which is referred to as the ``blindfolded game'' in this paper. We show that when the players use Thompson sampling, the game dynamics converges to the Nash equilibrium under a mild assumption on the… ▽ More When two players are engaged in a repeated game with unknown payoff matrices, they may be completely unaware of the existence of each other and use multi-armed bandit algorithms to choose the actions, which is referred to as the ``blindfolded game'' in this paper. We show that when the players use Thompson sampling, the game dynamics converges to the Nash equilibrium under a mild assumption on the payoff matrices. Therefore, algorithmic collusion doesn't arise in this case despite the fact that the players do not intentionally deploy competitive strategies. To prove the convergence result, we find that the framework developed in stochastic approximation doesn't apply, because of the sporadic and infrequent updates of the inferior actions and the lack of Lipschitz continuity. We develop a novel sample-path-wise approach to show the convergence. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.17247 [pdf, other]

An Introduction to Vision-Language Modeling

Authors: Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie , et al. (16 additional authors not shown)

Abstract: Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol… ▽ More Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16127 [pdf, other]

Finetuning Large Language Model for Personalized Ranking

Authors: Zhuoxi Bai, Ning Wu, Fengyu Cai, Xinyi Zhu, Yun Xiong

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this st… ▽ More Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system. △ Less

Submitted 20 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.15198 [pdf, other]

RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference

Authors: Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Abstract: Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to de… ▽ More Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to design and train the classifiers. To address these limitations, this paper proposes RAEE, a training-free Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's existing information. Next, the paper details the process of collecting existing information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. RAEE also achieves state-of-the-art zero-shot performance on 8 classification tasks. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.11914 [pdf, other]

PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images

Authors: Yiheng Xiong, Angela Dai

Abstract: Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabi… ▽ More Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios. △ Less

Submitted 6 August, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

Comments: 10 pages, 6 figures. Accepted to BMVC 2024

arXiv:2405.11535 [pdf, ps, other]

Proving Functional Program Equivalence via Directed Lemma Synthesis

Authors: Yican Sun, Ruyi Ji, Jian Fang, Xuanlin Jiang, Mingshuai Chen, Yingfei Xiong

Abstract: Proving equivalence between functional programs is a fundamental problem in program verification, which often amounts to reasoning about algebraic data types (ADTs) and compositions of structural recursions. Modern theorem provers address this problem by applying structural induction, which is insufficient for proving many equivalence theorems. In such cases, one has to invent a set of lemmas, pro… ▽ More Proving equivalence between functional programs is a fundamental problem in program verification, which often amounts to reasoning about algebraic data types (ADTs) and compositions of structural recursions. Modern theorem provers address this problem by applying structural induction, which is insufficient for proving many equivalence theorems. In such cases, one has to invent a set of lemmas, prove these lemmas by additional induction, and use these lemmas to prove the original theorem. There is, however, a lack of systematic understanding of what lemmas are needed for inductive proofs and how these lemmas can be synthesized automatically. This paper presents directed lemma synthesis, an effective approach to automating equivalence proofs by discovering critical lemmas using program synthesis techniques. We first identify two induction-friendly forms of propositions that give formal guarantees to the progress of the proof. We then propose two tactics that synthesize and apply lemmas, thereby transforming the proof goal into induction-friendly forms. Both tactics reduce lemma synthesis to a specialized class of program synthesis problems with efficient algorithms. Experimental results demonstrate the effectiveness of our approach: Compared to state-of-the-art equivalence checkers employing heuristic-based lemma enumeration, directed lemma synthesis saves 95.47% runtime on average and solves 38 more tasks over an extended version of the standard benchmark set. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 21 pages

arXiv:2405.10300 [pdf, other]

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Authors: Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang

Abstract: This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model o… ▽ More This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API △ Less

Submitted 31 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: homepage: https://deepdataspace.com/home

arXiv:2405.10132 [pdf, other]

Cooperative Visual-LiDAR Extrinsic Calibration Technology for Intersection Vehicle-Infrastructure: A review

Authors: Xinyu Zhang, Yijin Xiong, Qianxin Qu, Renjie Wang, Xin Gao, Jing Liu, Shichun Guo, Jun Li

Abstract: In the typical urban intersection scenario, both vehicles and infrastructures are equipped with visual and LiDAR sensors. By successfully integrating the data from vehicle-side and road monitoring devices, a more comprehensive and accurate environmental perception and information acquisition can be achieved. The Calibration of sensors, as an essential component of autonomous driving technology, ha… ▽ More In the typical urban intersection scenario, both vehicles and infrastructures are equipped with visual and LiDAR sensors. By successfully integrating the data from vehicle-side and road monitoring devices, a more comprehensive and accurate environmental perception and information acquisition can be achieved. The Calibration of sensors, as an essential component of autonomous driving technology, has consistently drawn significant attention. Particularly in scenarios involving multiple sensors collaboratively perceiving and addressing localization challenges, the requirement for inter-sensor calibration becomes crucial. Recent years have witnessed the emergence of the concept of multi-end cooperation, where infrastructure captures and transmits surrounding environment information to vehicles, bolstering their perception capabilities while mitigating costs. However, this also poses technical complexities, underscoring the pressing need for diverse end calibration. Camera and LiDAR, the bedrock sensors in autonomous driving, exhibit expansive applicability. This paper comprehensively examines and analyzes the calibration of multi-end camera-LiDAR setups from vehicle, roadside, and vehicle-road cooperation perspectives, outlining their relevant applications and profound significance. Concluding with a summary, we present our future-oriented ideas and hypotheses. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.07144 [pdf, other]

Optical transition parameters of the silicon T centre

Authors: Chloe Clear, Sara Hosseini, Amirhossein AlizadehKhaledi, Nicholas Brunelle, Austin Woolverton, Joshua Kanaganayagam, Moein Kazemi, Camille Chartrand, Mehdi Keshavarz, Yihuang Xiong, Oney O. Soykal, Geoffroy Hautier, Valentin Karassiouk, Mike Thewalt, Daniel Higginbottom, Stephanie Simmons

Abstract: The silicon T centre's narrow, telecommunications-band optical emission, long spin coherence, and direct photonic integration have spurred interest in this emitter as a spin-photon interface for distributed quantum computing and networking. However, key parameters of the T centre's spin-selective optical transitions remain undetermined or ambiguous in literature. In this paper we present a Hamilto… ▽ More The silicon T centre's narrow, telecommunications-band optical emission, long spin coherence, and direct photonic integration have spurred interest in this emitter as a spin-photon interface for distributed quantum computing and networking. However, key parameters of the T centre's spin-selective optical transitions remain undetermined or ambiguous in literature. In this paper we present a Hamiltonian of the T centre TX state and determine key parameters of the optical transition from T$_0$ to TX$_0$ from a combined analysis of published results, density functional theory, and new spectroscopy. We resolve ambiguous values of the internal defect potential in the literature, and we present the first measurements of electrically tuned T centre emission. As a result, we provide a model of the T centre's optical and spin properties under strain, electric, and magnetic fields that can be utilized for realizing quantum technologies. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 9 pages and 6 figures in the main manuscript. 10 pages and 6 figures in the supplementary information

arXiv:2405.05165 [pdf, other]

Discovery of T center-like quantum defects in silicon

Authors: Yihuang Xiong, Jiongzhi Zheng, Shay McBride, Xueyue Zhang, Sinéad M. Griffin, Geoffroy Hautier

Abstract: Quantum technologies would benefit from the development of high performance quantum defects acting as single-photon emitters or spin-photon interface. Finding such a quantum defect in silicon is especially appealing in view of its favorable spin bath and high processability. While some color centers in silicon have been emerging in quantum applications, there is still a need to search and develop… ▽ More Quantum technologies would benefit from the development of high performance quantum defects acting as single-photon emitters or spin-photon interface. Finding such a quantum defect in silicon is especially appealing in view of its favorable spin bath and high processability. While some color centers in silicon have been emerging in quantum applications, there is still a need to search and develop new high performance quantum emitters. Searching a high-throughput computational database of more than 22,000 charged complex defects in silicon, we identify a series of defects formed by a group III element combined with carbon ((A-C)$\rm _{Si}$ with A=B,Al,Ga,In,Tl) and substituting on a silicon site. These defects are analogous structurally, electronically and chemically to the well-known T center in silicon ((C-C-H)$\rm_{Si}$) and their optical properties are mainly driven by an unpaired electron in a carbon $p$ orbital. They all emit in the telecom and some of these color centers show improved properties compared to the T center in terms of computed radiative lifetime or emission efficiency. We also show that the synthesis of hydrogenated T center-like defects followed by a dehydrogenation annealing step could be an efficient way of synthesis. All the T center-like defects show a higher symmetry than the T center making them easier to align with magnetic fields. Our work motivates further studies on the synthesis and control of this new family of quantum defects, and also demonstrates the use of high-throughput computational screening to detect new complex quantum defects. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Showing 1–50 of 616 results for author: Xiong, Y