Search | arXiv e-print repository

An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting

Abstract: This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the perfor… ▽ More This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the performance of LLMTIME across multiple datasets and introduce classical almost periodic functions as time series to gauge its effectiveness. The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals. The primary finding of this study is that the predictive capacity of LLMTIME, similar to other LLMs, significantly deteriorates when dealing with time series data that contain both periodic and trend components, as well as when the signal comprises complex frequency components. △ Less

Submitted 9 August, 2024; originally announced August 2024.

arXiv:2407.20469 [pdf]

Efficient, gigapixel-scale, aberration-free whole slide scanner using angular ptychographic imaging with closed-form solution

Authors: Shi Zhao, Haowen Zhou, Siyu Lin, Ruizhi Cao, Changhuei Yang

Abstract: Whole slide imaging provides a wide field-of-view (FOV) across cross-sections of biopsy or surgery samples, significantly facilitating pathological analysis and clinical diagnosis. Such high-quality images that enable detailed visualization of cellular and tissue structures are essential for effective patient care and treatment planning. To obtain such high-quality images for pathology application… ▽ More Whole slide imaging provides a wide field-of-view (FOV) across cross-sections of biopsy or surgery samples, significantly facilitating pathological analysis and clinical diagnosis. Such high-quality images that enable detailed visualization of cellular and tissue structures are essential for effective patient care and treatment planning. To obtain such high-quality images for pathology applications, there is a need for scanners with high spatial bandwidth products, free from aberrations, and without the requirement for z-scanning. Here we report a whole slide imaging system based on angular ptychographic imaging with a closed-form solution (WSI-APIC), which offers efficient, tens-of-gigapixels, large-FOV, aberration-free imaging. WSI-APIC utilizes oblique incoherent illumination for initial high-level segmentation, thereby bypassing unnecessary scanning of the background regions and enhancing image acquisition efficiency. A GPU-accelerated APIC algorithm analytically reconstructs phase images with effective digital aberration corrections and improved optical resolutions. Moreover, an auto-stitching technique based on scale-invariant feature transform ensures the seamless concatenation of whole slide phase images. In our experiment, WSI-APIC achieved an optical resolution of 772 nm using a 10x/0.25 NA objective lens and captures 80-gigapixel aberration-free phase images for a standard 76.2 mm x 25.4 mm microscopic slide. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.10956 [pdf, other]

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Authors: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

Abstract: Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit… ▽ More Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 34 pages, 14 figures, 10 tables

arXiv:2407.08333 [pdf, other]

SR-Mamba: Effective Surgical Phase Recognition with State Space Model

Authors: Rui Cao, Jiangliu Wang, Yun-Hui Liu

Abstract: Surgical phase recognition is crucial for enhancing the efficiency and safety of computer-assisted interventions. One of the fundamental challenges involves modeling the long-distance temporal relationships present in surgical videos. Inspired by the recent success of Mamba, a state space model with linear scalability in sequence length, this paper presents SR-Mamba, a novel attention-free model s… ▽ More Surgical phase recognition is crucial for enhancing the efficiency and safety of computer-assisted interventions. One of the fundamental challenges involves modeling the long-distance temporal relationships present in surgical videos. Inspired by the recent success of Mamba, a state space model with linear scalability in sequence length, this paper presents SR-Mamba, a novel attention-free model specifically tailored to meet the challenges of surgical phase recognition. In SR-Mamba, we leverage a bidirectional Mamba decoder to effectively model the temporal context in overlong sequences. Moreover, the efficient optimization of the proposed Mamba decoder facilitates single-step neural network training, eliminating the need for separate training steps as in previous works. This single-step training approach not only simplifies the training process but also ensures higher accuracy, even with a lighter spatial feature extractor. Our SR-Mamba establishes a new benchmark in surgical video analysis by demonstrating state-of-the-art performance on the Cholec80 and CATARACTS Challenge datasets. The code is accessible at https://github.com/rcao-hk/SR-Mamba. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Technical Report

arXiv:2407.00932 [pdf, other]

Orbital phases of $p$-band ultracold fermions in the frustrated triangular lattice

Authors: Jiaqi Wu, Hui Tan, Rui Cao, Jianmin Yuan, Yongqiang Li

Abstract: Orbital degrees of freedom play an important role for understanding the emergence of unconventional quantum phases. Ultracold atomic gases in optical lattices provide a wonderful platform to simulate orbital physics. In this work, we consider spinless fermionic atoms loaded into $p$-orbital bands of a two-dimensional frustrated triangular lattice. The system can be described by an extended Fermi-H… ▽ More Orbital degrees of freedom play an important role for understanding the emergence of unconventional quantum phases. Ultracold atomic gases in optical lattices provide a wonderful platform to simulate orbital physics. In this work, we consider spinless fermionic atoms loaded into $p$-orbital bands of a two-dimensional frustrated triangular lattice. The system can be described by an extended Fermi-Hubbard model, which is numerically solved by using the orbital version of real-space dynamical mean-field theory. Low-temperature phase diagrams are obtained, which contain stripe-, ferro- and para-orbital ordered quantum phases, due to the interplay of anisotropic hoppings and geometrical frustration. In order to understand the underlying mechanics of competing orbital orders, we derive an effective orbital-exchange model, which yields consistent explanation with our main numerical results. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 9 pages, 7 figures

arXiv:2407.00383 [pdf, other]

FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection

Authors: Rui Cao, Shijie Xue, Jindong Li, Qi Wang, Yi Chang

Abstract: Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent… ▽ More Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent outputs across both architectures, making it difficult to distinguish abnormal graphs from normal graphs. Also, existing methods mainly rely on graph features to distinguish anomalies, which may be unstable with complex and diverse data and fail to capture the essence that differentiates normal graphs from abnormal ones. In this work, we propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD in short). We introduce normalizing flows to unsupervised graph-level anomaly detection due to their successful application and superior quality in learning the underlying distribution of samples. Specifically, we adopt the knowledge distillation technique and apply normalizing flows on the source network, achieving the asymmetric network. In the training stage, FANFOLD transforms the original distribution of normal graphs to a standard normal distribution. During inference, FANFOLD computes the anomaly score using the source-target loss to discriminate between normal and anomalous graphs. We conduct extensive experiments on 15 datasets of different fields with 9 baseline methods to validate the superiority of FANFOLD. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.05546 [pdf, other]

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

Authors: Ray Cao, Sherry Luo, Steve Gan, Sujeeth Jinesh

Abstract: In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers… ▽ More In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.04536 [pdf, other]

Emergence of topological states in relaxation dynamics of interacting bosons

Authors: Wang Huang, Xuchen Yang, Rui Cao, Yinghai Wu, Jianmin Yuan, Yongqiang Li

Abstract: Topological concepts have been employed to understand the ground states of many strongly correlated systems, but it is still quite unclear if and how topology manifests itself in the relaxation dynamics. Here we uncover emergent topological phenomena in the time evolution of far-from-equilibrium one-dimensional interacting bosons. Beginning with simple product states, the system evolves into long-… ▽ More Topological concepts have been employed to understand the ground states of many strongly correlated systems, but it is still quite unclear if and how topology manifests itself in the relaxation dynamics. Here we uncover emergent topological phenomena in the time evolution of far-from-equilibrium one-dimensional interacting bosons. Beginning with simple product states, the system evolves into long-time stationary states with high energy that are nonthermal for a wide range of parameters, and they exhibit nonlocal string correlation that is characteristic of the symmetry-protected topological ground state of the Hamiltonian. In contrast, no topological feature is found in the stationary state as long as the system thermalizes. This difference is further corroborated by the distinct behaviour of quantum entanglement and edge states of the system. Our theoretical prediction can be examined by current experimental techniques and paves the way for a more comprehensive understanding of topological phases in nonequilibrium settings. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 7 pages, 4 figures, with supplymentary information

arXiv:2406.01422 [pdf, other]

How to Understand Whole Software Repository?

Authors: Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li

Abstract: Recently, Large Language Model (LLM) based agents have advanced the significant development of Automatic Software Engineering (ASE). Although verified effectiveness, the designs of the existing methods mainly focus on the local information of codes, e.g., issues, classes, and functions, leading to limitations in capturing the global context and interdependencies within the software system. From th… ▽ More Recently, Large Language Model (LLM) based agents have advanced the significant development of Automatic Software Engineering (ASE). Although verified effectiveness, the designs of the existing methods mainly focus on the local information of codes, e.g., issues, classes, and functions, leading to limitations in capturing the global context and interdependencies within the software system. From the practical experiences of the human SE developers, we argue that an excellent understanding of the whole repository will be the critical path to ASE. However, understanding the whole repository raises various challenges, e.g., the extremely long code input, the noisy code information, the complex dependency relationships, etc. To this end, we develop a novel ASE method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. Specifically, we first condense the critical information of the whole repository into the repository knowledge graph in a top-to-down mode to decrease the complexity of repository. Subsequently, we empower the agents the ability of understanding whole repository by proposing a Monte Carlo tree search based repository exploration strategy. In addition, to better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan. Then, they can manipulate the tools to dynamically acquire information and generate the patches to solve the real-world GitHub issues. Extensive experiments demonstrate the superiority and effectiveness of the proposed RepoUnderstander. It achieved 18.5\% relative improvement on the SWE-bench Lite benchmark compared to SWE-agent. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.03126 [pdf]

Infrared Polarization Imaging-based Non-destructive Thermography Inspection

Authors: Xianyu Wu, Bin Zhou, Peng Lin, Rongjin Cao, Feng Huang

Abstract: Infrared pulse thermography non-destructive testing (NDT) method is developed based on the difference in the infrared radiation intensity emitted by defective and non-defective areas of an object. However, when the radiation intensity of the defective target is similar to that of the non-defective area of the object, the detection results are poor. To address this issue, this study investigated th… ▽ More Infrared pulse thermography non-destructive testing (NDT) method is developed based on the difference in the infrared radiation intensity emitted by defective and non-defective areas of an object. However, when the radiation intensity of the defective target is similar to that of the non-defective area of the object, the detection results are poor. To address this issue, this study investigated the polarization characteristics of the infrared radiation of different materials. Simulation results showed that the degree of infrared polarization of the object surface changed regularly with changes in thermal environment radiation. An infrared polarization imaging-based NDT method was proposed and demonstrated using specimens with four different simulated defective areas, which were designed and fabricated using four different materials. The experimental results were consistent with the simulation results, thereby proving the effectiveness of the proposed method. Compared with the infrared-radiation-intensity-based NDT method, the proposed method improved the image detail presentation and detection accuracy. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02712 [pdf, other]

CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions

Authors: Hanchong Zhang, Ruisheng Cao, Hongshen Xu, Lu Chen, Kai Yu

Abstract: Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs' reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a… ▽ More Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs' reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a few operations due to the context dependency. We introduce our method called CoE-SQL which can prompt LLMs to generate the SQL query based on the previously generated SQL query with an edition chain. We also conduct extensive ablation studies to determine the optimal configuration of our approach. Our approach outperforms different in-context learning baselines stably and achieves state-of-the-art performances on two benchmarks SParC and CoSQL using LLMs, which is also competitive to the SOTA fine-tuned models. △ Less

Submitted 4 May, 2024; originally announced May 2024.

arXiv:2405.02660 [pdf, other]

AFDM Channel Estimation in Multi-Scale Multi-Lag Channels

Authors: Rongyou Cao, Yuheng Zhong, Jiangbin Lyu, Deqing Wang, Liqun Fu

Abstract: Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a resul… ▽ More Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a result of mobility, while ignoring the accompanied Doppler time scaling (DTS) on waveform. However, for underwater acoustic (UWA) communication at much lower CF and propagating at speed of sound, the DTS effect could not be ignored and poses significant challenges for channel estimation. This paper analyzes the channel frequency response (CFR) of AFDM under multi-scale multi-lag (MSML) channels, where each propagating path could have different delay and DFS/DTS. Based on the newly derived input-output formula and its characteristics, two new channel estimation methods are proposed, i.e., AFDM with iterative multi-index (AFDM-IMI) estimation under low to moderate DTS, and AFDM with orthogonal matching pursuit (AFDM-OMP) estimation under high DTS. Numerical results confirm the effectiveness of the proposed methods against the original AFDM channel estimation method. Moreover, the resulted AFDM system outperforms OFDM as well as Orthogonal Chirp Division Multiplexing (OCDM) in terms of channel estimation accuracy and bit error rate (BER), which is consistent with our theoretical analysis based on CFR overlap probability (COP), mutual incoherent property (MIP) and channel diversity gain under MSML channels. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: 6 pages, 6 figures. Investigate AFDM under underwater multi-scale multi-lag channels. Derive the new input-output formula with the impact of Doppler time scaling. Propose two new channel estimation methods to tackle different level of Doppler factors. Perform diversity analyis based on CFR overlap probability (COP) and mutual incoherent property (MIP)

arXiv:2405.01958 [pdf, other]

Improved distance correlation estimation

Authors: Blanca E. Monroy-Castillo, M. A, Jácome, Ricardo Cao

Abstract: Distance correlation is a novel class of multivariate dependence measure, taking positive values between 0 and 1, and applicable to random vectors of arbitrary dimensions, not necessarily equal. It offers several advantages over the well-known Pearson correlation coefficient, the most important is that distance correlation equals zero if and only if the random vectors are independent. There are… ▽ More Distance correlation is a novel class of multivariate dependence measure, taking positive values between 0 and 1, and applicable to random vectors of arbitrary dimensions, not necessarily equal. It offers several advantages over the well-known Pearson correlation coefficient, the most important is that distance correlation equals zero if and only if the random vectors are independent. There are two different estimators of the distance correlation available in the literature. The first one, proposed by Székely et al. (2007), is based on an asymptotically unbiased estimator of the distance covariance which turns out to be a V-statistic. The second one builds on an unbiased estimator of the distance covariance proposed in Székely et al. (2014), proved to be an U-statistic by Székely and Huo (2016). This study evaluates their efficiency (mean squared error) and compares computational times for both methods under different dependence structures. Under conditions of independence or near-independence, the V-estimates are biased, while the U-estimator frequently cannot be computed due to negative values. To address this challenge, a convex linear combination of the former estimators is proposed and studied, yielding good results regardless of the level of dependence. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.08979 [pdf, other]

BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection

Authors: Jian Zhang, Ruiteng Zhang, Xinyue Yan, Xiting Zhuang, Ruicheng Cao

Abstract: Degraded underwater images decrease the accuracy of underwater object detection. However, existing methods for underwater image enhancement mainly focus on improving the indicators in visual aspects, which may not benefit the tasks of underwater image detection, and may lead to serious degradation in performance. To alleviate this problem, we proposed a bidirectional-guided method for underwater o… ▽ More Degraded underwater images decrease the accuracy of underwater object detection. However, existing methods for underwater image enhancement mainly focus on improving the indicators in visual aspects, which may not benefit the tasks of underwater image detection, and may lead to serious degradation in performance. To alleviate this problem, we proposed a bidirectional-guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, network is organized by constructing an enhancement branch and a detection branch in a parallel way. The enhancement branch consists of a cascade of an image enhancement subnet and an object detection subnet. And the detection branch only consists of a detection subnet. A feature guided module connects the shallow convolution layer of the two branches. When training the enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained enhancement branch will be output to the feature guided module, constraining the optimization of detection branch through consistency loss and prompting detection branch to learn more detailed information of the objects. And hence the detection performance will be refined. During the detection tasks, only detection branch will be reserved so that no additional cost of computation will be introduced. Extensive experiments demonstrate that the proposed method shows significant improvement in performance of the detector in severely degraded underwater scenes while maintaining a remarkable detection speed. △ Less

Submitted 13 April, 2024; originally announced April 2024.

Comments: 15 pages, 8 figures, 4 tables

MSC Class: 68T07; 68T45 ACM Class: I.4.3; I.4.8; I.4.9; I.4.10; I.2.10

arXiv:2404.07972 [pdf, other]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Authors: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature… ▽ More Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io. △ Less

Submitted 30 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

Comments: 51 pages, 21 figures

arXiv:2404.05937 [pdf]

De-aberration for transcranial photoacoustic computed tomography through an adult human skull

Authors: Yousuf Aborahama, Karteekeya Sastry, Manxiu Cui, Yang Zhang, Yilin Luo, Rui Cao, Lihong V. Wang

Abstract: Noninvasive transcranial photoacoustic computed tomography (PACT) of the human brain, despite its clinical potential, remains impeded by the acoustic distortion induced by the human skull. The distortion, which is attributed to the markedly different material properties of the skull relative to soft tissue, results in heavily aberrated PACT images -- a problem that has remained unsolved in the pas… ▽ More Noninvasive transcranial photoacoustic computed tomography (PACT) of the human brain, despite its clinical potential, remains impeded by the acoustic distortion induced by the human skull. The distortion, which is attributed to the markedly different material properties of the skull relative to soft tissue, results in heavily aberrated PACT images -- a problem that has remained unsolved in the past two decades. Herein, we report the first successful experimental demonstration of the de-aberration of PACT images through an ex-vivo adult human skull using a homogeneous elastic model for the skull. Using only the geometry, position, and orientation of the skull, we accurately de-aberrate the PACT images of light-absorbing phantoms acquired through an ex-vivo human skull, in terms of the recovered phantom features, for different levels of phantom complexity and positions. Our work addresses the longstanding challenge of skull-induced aberrations in transcranial PACT and advances the field towards unlocking the full potential of transcranial human brain PACT. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 11 pages, 3 figures

arXiv:2404.01298 [pdf, other]

Noise2Image: Noise-Enabled Static Scene Recovery for Event Cameras

Authors: Ruiming Cao, Dekel Galor, Amit Kohli, Jacob L Yates, Laura Waller

Abstract: Event cameras capture changes of intensity over time as a stream of 'events' and generally cannot measure intensity itself; hence, they are only used for imaging dynamic scenes. However, fluctuations due to random photon arrival inevitably trigger noise events, even for static scenes. While previous efforts have been focused on filtering out these undesirable noise events to improve signal quality… ▽ More Event cameras capture changes of intensity over time as a stream of 'events' and generally cannot measure intensity itself; hence, they are only used for imaging dynamic scenes. However, fluctuations due to random photon arrival inevitably trigger noise events, even for static scenes. While previous efforts have been focused on filtering out these undesirable noise events to improve signal quality, we find that, in the photon-noise regime, these noise events are correlated with the static scene intensity. We analyze the noise event generation and model its relationship to illuminance. Based on this understanding, we propose a method, called Noise2Image, to leverage the illuminance-dependent noise characteristics to recover the static parts of a scene, which are otherwise invisible to event cameras. We experimentally collect a dataset of noise events on static scenes to train and validate Noise2Image. Our results show that Noise2Image can robustly recover intensity images solely from noise events, providing a novel approach for capturing static scenes in event cameras, without additional hardware. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.16786 [pdf, other]

doi 10.1109/LRA.2024.3387145

DBPF: A Framework for Efficient and Robust Dynamic Bin-Picking

Authors: Yichuan Li, Junkai Zhao, Yixiao Li, Zheng Wu, Rui Cao, Masayoshi Tomizuka, Yunhui Liu

Abstract: Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional stati… ▽ More Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional static assumptions. The DBPF endows the robot with the reactivity to pick multiple moving arbitrary objects while avoiding dynamic obstacles, such as the moving bin. Combined with scene-level pose generation, the proposed pose selection metric leverages the Tendency-Aware Manipulability Network optimizing suction pose determination. Heuristic task-specific designs like velocity-matching, dynamic obstacle avoidance, and the resight policy, enhance the picking success rate and reliability. Empirical experiments demonstrate the importance of these components. Our method achieves an average 84% success rate, surpassing the 60% of the most comparable baseline, crucially, with zero collisions. Further evaluations under diverse dynamic scenarios showcase DBPF's robust performance in dynamic bin-picking. Results suggest that our framework offers a promising solution for efficient and reliable robotic bin-picking under dynamics. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 8 pages, 5 figures. This paper has been accepted by IEEE RA-L on 2024-03-24. See the supplementary video at youtube: https://youtu.be/n5af2VsKhkg

arXiv:2403.02948 [pdf, other]

Front-end electronics development of large-area SiPM arrays for high-precision single-photon time measurement

Authors: Wei Zhi, Ruike Cao, Jiannan Tang, Mingxin Wang, Yongqi Tan, Weihao Wu, Donglian Xu

Abstract: TRopIcal DEep-sea Neutrino Telescope (TRIDENT) plans to incorporate silicon photomultipliers (SiPMs) with superior time resolution in addition to photomultiplier tubes (PMTs) into its detection units, namely hybrid Digital Optical Modules (hDOMs), to improve its angular resolution. However, the time resolution significantly degrades for large-area SiPMs due to the large detector capacitance, posin… ▽ More TRopIcal DEep-sea Neutrino Telescope (TRIDENT) plans to incorporate silicon photomultipliers (SiPMs) with superior time resolution in addition to photomultiplier tubes (PMTs) into its detection units, namely hybrid Digital Optical Modules (hDOMs), to improve its angular resolution. However, the time resolution significantly degrades for large-area SiPMs due to the large detector capacitance, posing significant challenges for the readout electronics of SiPMs in hDOM. We analyzed the influences of series and parallel connections when constructing a large-area SiPM array and designed a series-parallel connection SiPM array with differential output. We also designed a high-speed pre-amplifier based on transformers (MABA-007159) and radio frequency amplifiers (BGA2803), and an analog multi-channel summing circuit based on operational amplifiers (LMH6629). We measured the single photon time resolution (SPTR) of a $4\times4$ SiPM (Hamamatsu S13360-3050PE) array ($12\times12~\mathrm{mm}^2$) of approximately 300 ps FWHM. This front-end readout design enables the large-area SiPM array to achieve high-precision single photon time measurement in one readout channel. △ Less

Submitted 7 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: Revised version. 12 pages, 10 figures

arXiv:2403.00129 [pdf, ps, other]

Average-Case Local Computation Algorithms

Authors: Amartya Shankha Biswas, Ruidi Cao, Edward Pyne, Ronitt Rubinfeld

Abstract: We initiate the study of Local Computation Algorithms on average case inputs. In the Local Computation Algorithm (LCA) model, we are given probe access to a huge graph, and asked to answer membership queries about some combinatorial structure on the graph, answering each query with sublinear work. For instance, an LCA for the $k$-spanner problem gives access to a sparse subgraph $H\subseteq G$ t… ▽ More We initiate the study of Local Computation Algorithms on average case inputs. In the Local Computation Algorithm (LCA) model, we are given probe access to a huge graph, and asked to answer membership queries about some combinatorial structure on the graph, answering each query with sublinear work. For instance, an LCA for the $k$-spanner problem gives access to a sparse subgraph $H\subseteq G$ that preserves distances up to a factor of $k$. We build simple LCAs for this problem assuming the input graph is drawn from the well-studied Erdos-Reyni and Preferential Attachment graph models. In both cases, our spanners achieve size and stretch tradeoffs that are impossible to achieve for general graphs, while having dramatically lower query complexity than worst-case LCAs. Our second result investigates the intersection of LCAs with Local Access Generators (LAGs). Local Access Generators provide efficient query access to a random object, for instance an Erdos Reyni random graph. We explore the natural problem of generating a random graph together with a combinatorial structure on it. We show that this combination can be easier to solve than focusing on each problem by itself, by building a fast, simple algorithm that provides access to an Erdos Reyni random graph together with a maximal independent set. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: 27 pages

arXiv:2402.18262 [pdf, other]

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

Authors: Hongshen Xu, Lu Chen, Zihan Zhao, Da Ma, Ruisheng Cao, Zichen Zhu, Kai Yu

Abstract: The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. Although various document modalities, including image, text, layout, and structure, facilitate human information retrieval, the interconnected nature of… ▽ More The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. Although various document modalities, including image, text, layout, and structure, facilitate human information retrieval, the interconnected nature of these modalities presents challenges for neural networks. In this paper, we introduce WebLM, a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages. Instead of processing document images as unified natural images, WebLM integrates the hierarchical structure of document images to enhance the understanding of markup-language-based documents. Additionally, we propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively. Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks. The pre-trained models and code are available at https://github.com/X-LANCE/weblm. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.18258 [pdf, other]

A BiRGAT Model for Multi-intent Spoken Language Understanding with Hierarchical Semantic Frames

Authors: Hongshen Xu, Ruisheng Cao, Su Zhu, Sheng Jiang, Hanchong Zhang, Lu Chen, Kai Yu

Abstract: Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent. This configuration significantly limits the surface form of user utterances and the capacity of output semantics. In this work, we first propose a Multi-Intent dataset which is collected from a realistic in-Vehicle dialogue System, called MIVS.… ▽ More Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent. This configuration significantly limits the surface form of user utterances and the capacity of output semantics. In this work, we first propose a Multi-Intent dataset which is collected from a realistic in-Vehicle dialogue System, called MIVS. The target semantic frame is organized in a 3-layer hierarchical structure to tackle the alignment and assignment problems in multi-intent cases. Accordingly, we devise a BiRGAT model to encode the hierarchy of ontology items, the backbone of which is a dual relational graph attention network. Coupled with the 3-way pointer-generator decoder, our method outperforms traditional sequence labeling and classification-based schemes by a large margin. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.11845 [pdf, other]

Modularized Networks for Few-shot Hateful Meme Detection

Authors: Rui Cao, Roy Ka-Wei Lee, Jing Jiang

Abstract: In this paper, we address the challenge of detecting hateful memes in the low-resource setting where only a few labeled examples are available. Our approach leverages the compositionality of Low-rank adaptation (LoRA), a widely used parameter-efficient tuning technique. We commence by fine-tuning large language models (LLMs) with LoRA on selected tasks pertinent to hateful meme detection, thereby… ▽ More In this paper, we address the challenge of detecting hateful memes in the low-resource setting where only a few labeled examples are available. Our approach leverages the compositionality of Low-rank adaptation (LoRA), a widely used parameter-efficient tuning technique. We commence by fine-tuning large language models (LLMs) with LoRA on selected tasks pertinent to hateful meme detection, thereby generating a suite of LoRA modules. These modules are capable of essential reasoning skills for hateful meme detection. We then use the few available annotated samples to train a module composer, which assigns weights to the LoRA modules based on their relevance. The model's learnable parameters are directly proportional to the number of LoRA modules. This modularized network, underpinned by LLMs and augmented with LoRA modules, exhibits enhanced generalization in the context of hateful meme detection. Our evaluation spans three datasets designed for hateful meme detection in a few-shot learning context. The proposed method demonstrates superior performance to traditional in-context learning, which is also more computationally intensive during inference.We then use the few available annotated samples to train a module composer, which assigns weights to the LoRA modules based on their relevance. The model's learnable parameters are directly proportional to the number of LoRA modules. This modularized network, underpinned by LLMs and augmented with LoRA modules, exhibits enhanced generalization in the context of hateful meme detection. Our evaluation spans three datasets designed for hateful meme detection in a few-shot learning context. The proposed method demonstrates superior performance to traditional in-context learning, which is also more computationally intensive during inference. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: camera-ready for WWW, 2024, Web4Good

arXiv:2402.05589 [pdf, other]

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

Authors: Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen

Abstract: Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) a… ▽ More Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation. △ Less

Submitted 11 February, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.02541 [pdf, other]

Knowledge Generation for Zero-shot Knowledge-based VQA

Authors: Rui Cao, Jing Jiang

Abstract: Previous solutions to knowledge-based visual question answering~(K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model. Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results. However, these recent methods do not explicitly show the knowledge needed to answer th… ▽ More Previous solutions to knowledge-based visual question answering~(K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model. Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results. However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability. Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: accepted as Findings in EACL 2023

arXiv:2401.17987 [pdf, other]

doi 10.1093/biomet/asaa092

Bagging cross-validated bandwidths with application to Big Data

Authors: Daniel Barreiro-Ures, Ricardo Cao, Mario Francisco Fernández, Jeffrey D. Hart

Abstract: Hall and Robinson (2009) proposed and analyzed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall and Robinson (2009) assumes that $N$, the number of bagged subsamples, is… ▽ More Hall and Robinson (2009) proposed and analyzed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall and Robinson (2009) assumes that $N$, the number of bagged subsamples, is $\infty$. We expand upon their theoretical results by allowing $N$ to be finite, as it is in practice. Our results indicate an important difference in the rate of convergence of the bagged cross-validation bandwidth for the cases $N=\infty$ and $N<\infty$. Simulations quantify the improvement in statistical efficiency and computational speed that can result from using bagged cross-validation as opposed to a binned implementation of ordinary cross-validation. The performance of thebagged bandwidth is also illustrated on a real, very large, data set. Finally, a byproduct of our study is the correction of errors appearing in the Hall and Robinson (2009) expression for the asymptotic mean squared error of the bagging selector. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: 37 pages, 9 figures

MSC Class: 62G07 (Primary); 62G20 (Secondary)

Journal ref: Bagging cross-validated bandwidths with application to Big Data. Biometrika (2021), 108(4), 981-988

arXiv:2401.17347 [pdf, ps, other]

doi 10.1007/s10489-021-02311-8

Cure models to estimate time until hospitalization due to COVID-19

Authors: Maria Pedrosa-Laza, Ana López-Cheda, Ricardo Cao

Abstract: A short introduction to survival analysis and censored data is included in this paper. A thorough literature review in the field of cure models has been done. An overview on the most important and recent approaches on parametric, semiparametric and nonparametric mixture cure models is also included. The main nonparametric and semiparametric approaches were applied to a real time dataset of COVID-1… ▽ More A short introduction to survival analysis and censored data is included in this paper. A thorough literature review in the field of cure models has been done. An overview on the most important and recent approaches on parametric, semiparametric and nonparametric mixture cure models is also included. The main nonparametric and semiparametric approaches were applied to a real time dataset of COVID-19 patients from the first weeks of the epidemic in Galicia (NW Spain). The aim is to model the elapsed time from diagnosis to hospital admission. The main conclusions, as well as the limitations of both the cure models and the dataset, are presented, illustrating the usefulness of cure models in this kind of studies, where the influence of age and sex on the time to hospital admission is shown. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 14 pages, 8 figures

Journal ref: Appl Intell, 2022, 52, 794-807

arXiv:2401.17152 [pdf, other]

doi 10.1016/j.csda.2016.08.002

Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models

Authors: Ana López-Cheda, Ricardo Cao, M. Amalia Jácome, Ingrid Van Keilegom

Abstract: A completely nonparametric method for the estimation of mixture cure models is proposed. A nonparametric estimator of the incidence is extensively studied and a nonparametric estimator of the latency is presented. These estimators, which are based on the Beran estimator of the conditional survival function, are proved to be the local maximum likelihood estimators. An i.i.d. representation is obtai… ▽ More A completely nonparametric method for the estimation of mixture cure models is proposed. A nonparametric estimator of the incidence is extensively studied and a nonparametric estimator of the latency is presented. These estimators, which are based on the Beran estimator of the conditional survival function, are proved to be the local maximum likelihood estimators. An i.i.d. representation is obtained for the nonparametric incidence estimator. As a consequence, an asymptotically optimal bandwidth is found. Moreover, a bootstrap bandwidth selection method for the nonparametric incidence estimator is proposed. The introduced nonparametric estimators are compared with existing semiparametric approaches in a simulation study, in which the performance of the bootstrap bandwidth selector is also assessed. Finally, the method is applied to a database of colorectal cancer from the University Hospital of A Coruña (CHUAC). △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 22 pages; 8 figures

Journal ref: Computational Statistics and Data Analysis, 2017, 105, 144-165

arXiv:2401.17110 [pdf, other]

doi 10.1002/sim.8530

Nonparametric covariate hypothesis tests for the cure rate in mixture cure models

Authors: Ana López-Cheda, M. Amalia Jácome, Ingrid Van Keilegom, Ricardo Cao

Abstract: In lifetime data, like cancer studies, theremay be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods.We fill this important gap by proposing a nonparam… ▽ More In lifetime data, like cancer studies, theremay be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods.We fill this important gap by proposing a nonparametric covariate hypothesis test for the probability of cure in mixture cure models. A bootstrap method is proposed to approximate the null distribution of the test statistic. The procedure can be applied to any type of covariate, and could be extended to the multivariate setting. Its efficiency is evaluated in a Monte Carlo simulation study. Finally, the method is applied to a colorectal cancer dataset. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 17 pages, 4 figures

Journal ref: Statistics in Medicine, 2020, 39, 2291-2307

arXiv:2401.16954 [pdf, ps, other]

doi 10.1007/s11749-016-0515-1

Nonparametric latency estimation for mixture cure models

Authors: Ana López-Cheda, M. Amalia Jácome, Ricardo Cao

Abstract: A nonparametric latency estimator for mixture cure models is studied in this paper. An i.i.d. representation is obtained, the asymptotic mean squared error of the latency estimator is found, and its asymptotic normality is proven. A bootstrap bandwidth selection method is introduced and its efficiency is evaluated in a simulation study. The proposed methods are applied to a dataset of colorectal c… ▽ More A nonparametric latency estimator for mixture cure models is studied in this paper. An i.i.d. representation is obtained, the asymptotic mean squared error of the latency estimator is found, and its asymptotic normality is proven. A bootstrap bandwidth selection method is introduced and its efficiency is evaluated in a simulation study. The proposed methods are applied to a dataset of colorectal cancer patients in the University Hospital of A Coruña (CHUAC). △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 24 pages, 3 figures

Journal ref: TEST, 2017, 26(2), 353 -376

arXiv:2401.16727 [pdf, other]

Recent Advances in Hate Speech Moderation: Multimodality and the Role of Large Models

Authors: Ming Shan Hee, Shivam Sharma, Rui Cao, Palash Nandi, Tanmoy Chakraborty, Roy Ka-Wei Lee

Abstract: In the evolving landscape of online communication, moderating hate speech (HS) presents an intricate challenge, compounded by the multimodal nature of digital content. This comprehensive survey delves into the recent strides in HS moderation, spotlighting the burgeoning role of large language models (LLMs) and large multimodal models (LMMs). Our exploration begins with a thorough analysis of curre… ▽ More In the evolving landscape of online communication, moderating hate speech (HS) presents an intricate challenge, compounded by the multimodal nature of digital content. This comprehensive survey delves into the recent strides in HS moderation, spotlighting the burgeoning role of large language models (LLMs) and large multimodal models (LMMs). Our exploration begins with a thorough analysis of current literature, revealing the nuanced interplay between textual, visual, and auditory elements in propagating HS. We uncover a notable trend towards integrating these modalities, primarily due to the complexity and subtlety with which HS is disseminated. A significant emphasis is placed on the advances facilitated by LLMs and LMMs, which have begun to redefine the boundaries of detection and moderation capabilities. We identify existing gaps in research, particularly in the context of underrepresented languages and cultures, and the need for solutions to handle low-resource settings. The survey concludes with a forward-looking perspective, outlining potential avenues for future research, including the exploration of novel AI methodologies, the ethical governance of AI in moderation, and the development of more nuanced, context-aware systems. This comprehensive overview aims to catalyze further research and foster a collaborative effort towards more sophisticated, responsible, and human-centric approaches to HS moderation in the digital era. WARNING: This paper contains offensive examples. △ Less

Submitted 1 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Preprint; Under-Review

arXiv:2401.15259 [pdf]

doi 10.1017/S0950268821000959

Estimating lengths-of-stay of hospitalised COVID-19 patients using a non-parametric model: a case study in Galicia (Spain)

Authors: Ana López-Cheda, M. Amalia Jácome, Ricardo Cao, Pablo M. De Salazar

Abstract: Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key for predicting the hospital beds' demand and planning mitigation strategies, as overwhelming the healthcare systems has critical consequences for disease mortality. However, accurately mapping the time-to-event of hospital outcomes, such as the LoS in the intensive care unit (ICU), requires understanding patient trajecto… ▽ More Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key for predicting the hospital beds' demand and planning mitigation strategies, as overwhelming the healthcare systems has critical consequences for disease mortality. However, accurately mapping the time-to-event of hospital outcomes, such as the LoS in the intensive care unit (ICU), requires understanding patient trajectories while adjusting for covariates and observation bias, such as incomplete data. Standard methods, such as the Kaplan-Meier estimator, require prior assumptions that are untenable given current knowledge. Using real-time surveillance data from the first weeks of the COVID-19 epidemic in Galicia (Spain), we aimed to model the time-to-event and event probabilities of patients' hospitalised, without parametric priors and adjusting for individual covariates. We applied a non-parametric mixture cure model and compared its performance in estimating hospital ward (HW)/ICU LoS to the performances of commonly used methods to estimate survival. We showed that the proposed model outperformed standard approaches, providing more accurate ICU and HW LoS estimates. Finally, we applied our model estimates to simulate COVID-19 hospital demand using a Monte Carlo algorithm. We provided evidence that adjusting for sex, generally overlooked in prediction models, together with age is key for accurately forecasting HW and ICU occupancy, as well as discharge or death outcomes. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 14 pages, 4 figures

Journal ref: Epidemiology and Infection; 149:e102, 2021

arXiv:2401.04866 [pdf]

Airline recovery problem under disruptions: A review

Authors: Shuai Wu, Enze Liu, Rui Cao, Qiang Bai

Abstract: In practice, both passenger and cargo flights are vulnerable to unexpected factors, such as adverse weather, airport flow control, crew absence, unexpected aircraft maintenance, and pandemic, which can cause disruptions in flight schedules. Thus, managers need to reallocate relevant resources to ensure that the airport can return to normal operations on the basis of minimum cost, which is the airl… ▽ More In practice, both passenger and cargo flights are vulnerable to unexpected factors, such as adverse weather, airport flow control, crew absence, unexpected aircraft maintenance, and pandemic, which can cause disruptions in flight schedules. Thus, managers need to reallocate relevant resources to ensure that the airport can return to normal operations on the basis of minimum cost, which is the airline recovery problem. Airline recovery is an active research area, with a lot of publications in recent years. To better summarize the progress of airline recovery, first of all, keywords are chosen to search the relevant studies, then software is used to analyze the existing studies in terms of the number of papers, keywords, and sources. Secondly, the airline recovery problem is divided into two categories, namely Passenger-Oriented Airline Recovery Problem (POARP) and Cargo-Oriented Airline Recovery Problem (COARP). In POARP, the existing studies are classified according to recovery strategies, including common recovery strategies, cruise speed control strategy, flexible aircraft maintenance strategy, multi-modal transportation strategy, passenger-centric recovery strategy, and clubbing of flights strategy. Moreover, the POARP is discussed from the perspectives of disruption types, recovery strategies, problem types, objective functions, and solution methods. Thirdly, POARP and COARP are compared from the perspectives of timeliness, subjectivity, flexibility, transferability, and combinability. Finally, the conclusions are drawn and future study directions are provided. For future studies, it is recommended to conduct more in-depth research on dynamic and real-time recovery, incorporating human factors into the modeling, multi-modal transportation coupling, optimization of other airport processes, combination of robust scheduling and airline recovery, and optimization algorithm improvement. △ Less

Submitted 16 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

arXiv:2312.13683 [pdf, other]

Joint Channel Estimation and Cooperative Localization for Near-Field Ultra-Massive MIMO

Authors: Ruoxiao Cao, Hengtao He, Xianghao Yu, Shenghui Song, Kaibin Huang, Jun Zhang, Yi Gong, Khaled B. Letaief

Abstract: The next-generation (6G) wireless networks are expected to provide not only seamless and high data-rate communications, but also ubiquitous sensing services. By providing vast spatial degrees of freedom (DoFs), ultra-massive multiple-input multiple-output (UM-MIMO) technology is a key enabler for both sensing and communications in 6G. However, the adoption of UM-MIMO leads to a shift from the far… ▽ More The next-generation (6G) wireless networks are expected to provide not only seamless and high data-rate communications, but also ubiquitous sensing services. By providing vast spatial degrees of freedom (DoFs), ultra-massive multiple-input multiple-output (UM-MIMO) technology is a key enabler for both sensing and communications in 6G. However, the adoption of UM-MIMO leads to a shift from the far field to the near field in terms of the electromagnetic propagation, which poses novel challenges in system design. Specifically, near-field effects introduce highly non-linear spherical wave models that render existing designs based on plane wave assumptions ineffective. In this paper, we focus on two crucial tasks in sensing and communications, respectively, i.e., localization and channel estimation, and investigate their joint design by exploring the near-field propagation characteristics, achieving mutual benefits between two tasks. In addition, multiple base stations (BSs) are leveraged to collaboratively facilitate a cooperative localization framework. To address the joint channel estimation and cooperative localization problem for near-field UM-MIMO systems, we propose a variational Newtonized near-field channel estimation (VNNCE) algorithm and a Gaussian fusion cooperative localization (GFCL) algorithm. The VNNCE algorithm exploits the spatial DoFs provided by the near-field channel to obtain position-related soft information, while the GFCL algorithm fuses this soft information to achieve more accurate localization. Additionally, we introduce a joint architecture that seamlessly integrates channel estimation and cooperative localization. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Submit to JSAC

arXiv:2312.11201 [pdf, other]

A Refining Underlying Information Framework for Monaural Speech Enhancement

Authors: Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t… ▽ More Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE. △ Less

Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: 5 pages

arXiv:2312.06094 [pdf, other]

doi 10.1145/3581783.3613463

MATK: The Meme Analytical Tool Kit

Authors: Ming Shan Hee, Aditi Kumaresan, Nguyen Khoi Hoang, Nirmalendu Prakash, Rui Cao, Roy Ka-Wei Lee

Abstract: The rise of social media platforms has brought about a new digital culture called memes. Memes, which combine visuals and text, can strongly influence public opinions on social and cultural issues. As a result, people have become interested in categorizing memes, leading to the development of various datasets and multimodal models that show promising results in this field. However, there is curren… ▽ More The rise of social media platforms has brought about a new digital culture called memes. Memes, which combine visuals and text, can strongly influence public opinions on social and cultural issues. As a result, people have become interested in categorizing memes, leading to the development of various datasets and multimodal models that show promising results in this field. However, there is currently a lack of a single library that allows for the reproduction, evaluation, and comparison of these models using fair benchmarks and settings. To fill this gap, we introduce the Meme Analytical Tool Kit (MATK), an open-source toolkit specifically designed to support existing memes datasets and cutting-edge multimodal models. MATK aims to assist researchers and engineers in training and reproducing these multimodal models for meme classification tasks, while also providing analysis techniques to gain insights into their strengths and weaknesses. To access MATK, please visit \url{https://github.com/Social-AI-Studio/MATK}. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: Accepted at ACM Multimedia'23 Open-Source Software Competition Track

ACM Class: I.1.4

arXiv:2311.18446 [pdf, other]

Length-of-stay times in hospital for COVID-19 patients using the smoothed Beran's estimator with bootstrap bandwidth selection

Authors: Rebeca Peláez, Ricardo Cao, Juan Vilar

Abstract: The survival function of length-of-stay in hospital ward and ICU for COVID-19 patients is studied in this paper. Flexible statistical methods are used to estimate this survival function given relevant covariates such as age, sex, obesity and chronic obstructive pulmonary disease (COPD). A doubly-smoothed Beran's estimator has been considered to this aim. The bootstrap method has been used to produ… ▽ More The survival function of length-of-stay in hospital ward and ICU for COVID-19 patients is studied in this paper. Flexible statistical methods are used to estimate this survival function given relevant covariates such as age, sex, obesity and chronic obstructive pulmonary disease (COPD). A doubly-smoothed Beran's estimator has been considered to this aim. The bootstrap method has been used to produce new smoothing parameter selectors and to construct confidence regions for the conditional survival function. Some simulation studies show the good performance of the proposed methods. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.14288 [pdf, other]

Fair Influence Maximization in Social Networks: A Community-Based Evolutionary Algorithm

Authors: Kaicong Ma, Xinxiang Xu, Haipeng Yang, Renzhi Cao, Lei Zhang

Abstract: Influence Maximization (IM) has been extensively studied in network science, which attempts to find a subset of users to maximize the influence spread. A new variant of IM, Fair Influence Maximization (FIM), which primarily enhances the fair propagation of information, attracts increasing attention in academic. However, existing algorithms for FIM suffer from a trade-off between fairness and runni… ▽ More Influence Maximization (IM) has been extensively studied in network science, which attempts to find a subset of users to maximize the influence spread. A new variant of IM, Fair Influence Maximization (FIM), which primarily enhances the fair propagation of information, attracts increasing attention in academic. However, existing algorithms for FIM suffer from a trade-off between fairness and running time. Since it is a tough task to ensure that users are fairly influenced in terms of sensitive attributes, such as race or gender, while maintaining a high influence spread. To tackle this problem, in this paper, we propose an effective and efficient Community-based Evolutionary Algorithm for FIM (named CEA-FIM). In CEA-FIM, a community-based node selection strategy is proposed to identify potential nodes, which not only considers the size of the community but also the attributes of the nodes in the community. Subsequently, we design an evolutionary algorithm based on the proposed node selection strategy to hasten the search for the optimal solution, including the novel initialization, crossover and mutation strategies. We validate the proposed algorithm CEA-FIM by performing experiments on real-world and synthetic networks. The experimental results show that the proposed CEA-FIM achieves a better balance between effectiveness and efficiency, compared to the state-of-the-art baseline algorithms. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2310.18662 [pdf, other]

ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL

Authors: Ruisheng Cao, Hanchong Zhang, Hongshen Xu, Jieyu Li, Da Ma, Lu Chen, Kai Yu

Abstract: Text-to-SQL aims to generate an executable SQL program given the user utterance and the corresponding database schema. To ensure the well-formedness of output SQLs, one prominent approach adopts a grammar-based recurrent decoder to produce the equivalent SQL abstract syntax tree (AST). However, previous methods mainly utilize an RNN-series decoder, which 1) is time-consuming and inefficient and 2)… ▽ More Text-to-SQL aims to generate an executable SQL program given the user utterance and the corresponding database schema. To ensure the well-formedness of output SQLs, one prominent approach adopts a grammar-based recurrent decoder to produce the equivalent SQL abstract syntax tree (AST). However, previous methods mainly utilize an RNN-series decoder, which 1) is time-consuming and inefficient and 2) introduces very few structure priors. In this work, we propose an AST structure-aware Transformer decoder (ASTormer) to replace traditional RNN cells. The structural knowledge, such as node types and positions in the tree, is seamlessly incorporated into the decoder via both absolute and relative position embeddings. Besides, the proposed framework is compatible with different traversing orders even considering adaptive node selection. Extensive experiments on five text-to-SQL benchmarks demonstrate the effectiveness and efficiency of our structured decoder compared to competitive baselines. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.17342 [pdf, other]

ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought

Authors: Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, Kai Yu

Abstract: Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks. We study the problem of prompt designing in the text-to-SQL task and attempt to improve the LLMs' reasoning ability when generating SQL queries. Besides the trivial few-shot in-context learning setting, we design our chain-of-thought (CoT) prompt with a similar method to schema linking. We… ▽ More Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks. We study the problem of prompt designing in the text-to-SQL task and attempt to improve the LLMs' reasoning ability when generating SQL queries. Besides the trivial few-shot in-context learning setting, we design our chain-of-thought (CoT) prompt with a similar method to schema linking. We provide a method named ACT-SQL to automatically generate auto-CoT exemplars and thus the whole process doesn't need manual labeling. Our approach is cost-saving since we only use the LLMs' API call once when generating one SQL query. Furthermore, we extend our in-context learning method to the multi-turn text-to-SQL task. The experiment results show that the LLMs' performance can benefit from our ACT-SQL approach. Our approach achieves SOTA performance on the Spider dev set among existing in-context learning approaches. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2309.13105 [pdf, other]

Nonabelian Kinetic Mixing in a Confining Phase

Authors: Gonzalo Alonso-Álvarez, Ruike Cao, James M. Cline, Karishma Moorthy, Tianzhuo Xiao

Abstract: Dark matter from a hidden sector with SU($N$) gauge symmetry can have a nonabelian kinetic mixing portal with the standard model. The dark photon becomes massive in the confining phase without the need for spontaneous symmetry breaking. Depending on the particle content of the dark sector, there can be two or more composite vectors that get kinetic mixing through a heavy mediator particle $X$. Thi… ▽ More Dark matter from a hidden sector with SU($N$) gauge symmetry can have a nonabelian kinetic mixing portal with the standard model. The dark photon becomes massive in the confining phase without the need for spontaneous symmetry breaking. Depending on the particle content of the dark sector, there can be two or more composite vectors that get kinetic mixing through a heavy mediator particle $X$. This provides a model of composite dark photons giving a portal for direct detection of dark baryons. Avoiding exotic charged relics requires additional couplings allowing $X$ to decay to dark quarks and standard model fields, leading to further portals between the dark matter and the standard model. We comprehensively study the constraints on such models from colliders, rare decays, direct detection, and big bang nucleosynthesis. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: 14 pages, 12 figures, comments welcome

arXiv:2309.02391 [pdf, other]

Empirical Review of Smart Contract and DeFi Security: Vulnerability Detection and Automated Repair

Authors: Peng Qian, Rui Cao, Zhenguang Liu, Wenqing Li, Ming Li, Lun Zhang, Yufeng Xu, Jianhai Chen, Qinming He

Abstract: Decentralized Finance (DeFi) is emerging as a peer-to-peer financial ecosystem, enabling participants to trade products on a permissionless blockchain. Built on blockchain and smart contracts, the DeFi ecosystem has experienced explosive growth in recent years. Unfortunately, smart contracts hold a massive amount of value, making them an attractive target for attacks. So far, attacks against smart… ▽ More Decentralized Finance (DeFi) is emerging as a peer-to-peer financial ecosystem, enabling participants to trade products on a permissionless blockchain. Built on blockchain and smart contracts, the DeFi ecosystem has experienced explosive growth in recent years. Unfortunately, smart contracts hold a massive amount of value, making them an attractive target for attacks. So far, attacks against smart contracts and DeFi protocols have resulted in billions of dollars in financial losses, severely threatening the security of the entire DeFi ecosystem. Researchers have proposed various security tools for smart contracts and DeFi protocols as countermeasures. However, a comprehensive investigation of these efforts is still lacking, leaving a crucial gap in our understanding of how to enhance the security posture of the smart contract and DeFi landscape. To fill the gap, this paper reviews the progress made in the field of smart contract and DeFi security from the perspective of both vulnerability detection and automated repair. First, we analyze the DeFi smart contract security issues and challenges. Specifically, we lucubrate various DeFi attack incidents and summarize the attacks into six categories. Then, we present an empirical study of 42 state-of-the-art techniques that can detect smart contract and DeFi vulnerabilities. In particular, we evaluate the effectiveness of traditional smart contract bug detection tools in analyzing complex DeFi protocols. Additionally, we investigate 8 existing automated repair tools for smart contracts and DeFi protocols, providing insight into their advantages and disadvantages. To make this work useful for as wide of an audience as possible, we also identify several open issues and challenges in the DeFi ecosystem that should be addressed in the future. △ Less

Submitted 6 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

Comments: This paper is submitted to the journal of Expert Systems with Applications (ESWA) for review

arXiv:2309.00755 [pdf]

High-resolution, large field-of-view label-free imaging via aberration-corrected, closed-form complex field reconstruction

Authors: Ruizhi Cao, Cheng Shen, Changhuei Yang

Abstract: Computational imaging methods empower modern microscopy with the ability of producing high-resolution, large field-of-view, aberration-free images. One of the dominant computational label-free imaging methods, Fourier ptychographic microscopy (FPM), effectively increases the spatial-bandwidth product of conventional microscopy by using multiple tilted illuminations to achieve high-throughput imagi… ▽ More Computational imaging methods empower modern microscopy with the ability of producing high-resolution, large field-of-view, aberration-free images. One of the dominant computational label-free imaging methods, Fourier ptychographic microscopy (FPM), effectively increases the spatial-bandwidth product of conventional microscopy by using multiple tilted illuminations to achieve high-throughput imaging. However, its iterative reconstruction method is prone to parameter selection, can be computationally expensive and tends to fail under excessive aberrations. Recently, spatial Kramers-Kronig methods show it is possible to analytically reconstruct complex field but lacks the ability of correcting aberrations or providing extended resolution enhancement. Here, we present a closed-form method, termed APIC, which weds the strengths of both methods. A new analytical phase retrieval framework is established in APIC, which demonstrates, for the first time, the feasibility of analytically reconstructing the complex field associated with darkfield measurements. In addition, APIC can analytically retrieve complex aberrations of an imaging system with no additional hardware. By avoiding iterative algorithms, APIC requires no human designed convergence metric and always obtains a closed-form complex field solution. The faithfulness and correctness of APIC's reconstruction are guaranteed due to its analytical nature. We experimentally demonstrate that APIC gives correct reconstruction result while FPM fails to do so when constrained to the same number of measurements. Meanwhile, APIC achieves 2.8 times faster computation using image tile size of 256 (length-wise). We also demonstrate APIC is unprecedentedly robust against aberrations compared to FPM - APIC is capable of addressing aberration whose maximal phase difference exceeds 3.8$πぱい$ when using a NA 0.25 objective in experiment. △ Less

Submitted 1 September, 2023; originally announced September 2023.

Comments: 13 pages, 5 figures

arXiv:2308.16636 [pdf, other]

doi 10.1103/PhysRevC.108.064906

Effects of the $αあるふぁ$-cluster structure and the intrinsic momentum component of nuclei on the longitudinal asymmetry in relativistic heavy-ion collisions

Authors: Ru-XIn Cao, Song Zhang, Yu-Gang Ma

Abstract: The longitudinal asymmetry in relativistic heavy ion collisions arises from the fluctuation in the number of nucleons involved. This asymmetry causes a rapidity shift in the center of mass of the participating zone. Both the rapidity shift and the longitudinal asymmetry have been found to be significant at the top CERN Large Hadron Collider (LHC) energy for collisions of identical nuclei, and the… ▽ More The longitudinal asymmetry in relativistic heavy ion collisions arises from the fluctuation in the number of nucleons involved. This asymmetry causes a rapidity shift in the center of mass of the participating zone. Both the rapidity shift and the longitudinal asymmetry have been found to be significant at the top CERN Large Hadron Collider (LHC) energy for collisions of identical nuclei, and the longitudinal asymmetry is important for reconstructing the colliding vertex and correcting the rapidity shift. However, much discussion of the longitudinal asymmetry has treated the initial condition as a nonzero momentum contributed only by the number of participants, i.e., the asymmetry depends only on the number of participating nucleons. So we naturally raise a physical problem, can other initial conditions, such as two typical initial conditions for nuclei, geometric configuration, and momentum distribution, provide effects on the longitudinal asymmetry? Therefore, in this work we consider other effects on the longitudinal asymmetry other than the fluctuation in the number of participants, e.g., the αあるふぁ clustering structure as well as the intrinsic momentum distribution in the target and projectile nuclei for the collisions in the framework of a multiphase transport (AMPT) model. By introducing systems with different αあるふぁ-clustering structure and intrinsic momentum distribution, we calculated the ratio of the rapidity distributions of different systems and extracted expansion coefficients to analyze the difference contributed by these factors. ... △ Less

Submitted 4 January, 2024; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: 13 pages, 7 figures

Journal ref: Physical Review C 108, 064906 (2023)

arXiv:2308.14127 [pdf, other]

Information geometric regularization of the barotropic Euler equation

Authors: Ruijia Cao, Florian Schäfer

Abstract: A key numerical difficulty in compressible fluid dynamics is the formation of shock waves. Shock waves feature jump discontinuities in the velocity and density of the fluid and thus preclude the existence of classical solutions to the compressible Euler equations. Weak entropy solutions are commonly defined by viscous regularization, but even small amounts of viscosity can substantially change the… ▽ More A key numerical difficulty in compressible fluid dynamics is the formation of shock waves. Shock waves feature jump discontinuities in the velocity and density of the fluid and thus preclude the existence of classical solutions to the compressible Euler equations. Weak entropy solutions are commonly defined by viscous regularization, but even small amounts of viscosity can substantially change the long-term behavior of the solution. In this work, we propose the first inviscid regularization of the multidimensional Euler equation based on ideas from semidefinite programming, information geometry, geometric hydrodynamics, and nonlinear elasticity. From a Lagrangian perspective, shock formation in entropy solutions amounts to inelastic collisions of fluid particles. Their trajectories are akin to that of projected gradient descent on a feasible set of non-intersecting paths. We regularize these trajectories by replacing them with solution paths of interior point methods based on log determinantal barrier functions. These paths are geodesic curves with respect to the information geometry induced by the barrier function. Thus, our regularization replaces the Euclidean geometry of trajectories with a suitable information geometry. We extend this idea to infinite families of paths by viewing Euler's equations as a dynamical system on a diffeomorphism manifold. Our regularization embeds this manifold into an information geometric ambient space, equipping it with a geodesically complete geometry. Expressing the resulting Lagrangian equations in Eulerian form, we derive a regularized Euler equation in conservation form. Numerical experiments on one and two-dimensional problems show its promise as a numerical tool. While we focus on the barotropic Euler equations for concreteness and simplicity of exposition, our regularization easily extends to more general Euler and Navier-Stokes-type equations. △ Less

Submitted 18 March, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

MSC Class: 35L65; 76L05; 65M25; 76J20; 58B20

arXiv:2308.08088 [pdf, other]

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Authors: Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, Jing Jiang

Abstract: Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, re… ▽ More Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: Camera-ready for 23, ACM MM

arXiv:2306.14471 [pdf]

Single-shot 3D photoacoustic computed tomography with a densely packed array for transcranial functional imaging

Authors: Rui Cao, Yilin Luo, Jinhua Xu, Xiaofei Luo, Ku Geng, Yousuf Aborahama, Manxiu Cui, Samuel Davis, Shuai Na, Xin Tong, Cindy Liu, Karteek Sastry, Konstantin Maslov, Peng Hu, Yide Zhang, Li Lin, Yang Zhang, Lihong V. Wang

Abstract: Photoacoustic computed tomography (PACT) is emerging as a new technique for functional brain imaging, primarily due to its capabilities in label-free hemodynamic imaging. Despite its potential, the transcranial application of PACT has encountered hurdles, such as acoustic attenuations and distortions by the skull and limited light penetration through the skull. To overcome these challenges, we hav… ▽ More Photoacoustic computed tomography (PACT) is emerging as a new technique for functional brain imaging, primarily due to its capabilities in label-free hemodynamic imaging. Despite its potential, the transcranial application of PACT has encountered hurdles, such as acoustic attenuations and distortions by the skull and limited light penetration through the skull. To overcome these challenges, we have engineered a PACT system that features a densely packed hemispherical ultrasonic transducer array with 3072 channels, operating at a central frequency of 1 MHz. This system allows for single-shot 3D imaging at a rate equal to the laser repetition rate, such as 20 Hzへるつ. We have achieved a single-shot light penetration depth of approximately 9 cm in chicken breast tissue utilizing a 750 nm laser (withstanding 3295-fold light attenuation and still retaining an SNR of 74) and successfully performed transcranial imaging through an ex vivo human skull using a 1064 nm laser. Moreover, we have proven the capacity of our system to perform single-shot 3D PACT imaging in both tissue phantoms and human subjects. These results suggest that our PACT system is poised to unlock potential for real-time, in vivo transcranial functional imaging in humans. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.11477 [pdf, other]

CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality

Authors: Liang Li, Ruiying Geng, Chengyang Fang, Bing Li, Can Ma, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li

Abstract: There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored. To alleviate these limitations, in this paper, we present CATS,… ▽ More There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored. To alleviate these limitations, in this paper, we present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality. The dataset aims to generate textual descriptions for the answer in the practical TableQA system. Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources and convert this task to a graph-to-text problem. The experiment results demonstrate the effectiveness of our proposed method. Further analysis on CATS attests to both the high quality and challenges of the dataset. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: ACL 2023

arXiv:2306.02625 [pdf, other]

Rethinking the visual cues in audio-visual speaker extraction

Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

Abstract: The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p… ▽ More The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted in Interspeech 2023

arXiv:2305.17369 [pdf, other]

Modularized Zero-shot VQA with Pre-trained Models

Authors: Rui Cao, Jing Jiang

Abstract: Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills… ▽ More Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines. △ Less

Submitted 24 January, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: accepted as Findings in ACL 2023; Code available: https://github.com/abril4416/Mod-Zero-VQA

Showing 1–50 of 157 results for author: Cao, R