-
Spectral bounds of multi-way Cheeger constants via cyclomatic number
Authors:
Chuanyuan Ge
Abstract:
As a non-trivial extension of the celebrated Cheeger inequality, the higher-order Cheeger inequalities for graphs due to Lee, Oveis Gharan and Trevisan provide for each $k$ an upper bound for the $k$-way Cheeger constant in forms of $C(k)\sqrt{λ_k(G)}$, where $λ_k(G)$ is the $k$-th eigenvalue of the graph Laplacian and $C(k)$ is a constant depending only on $k$. In this article, we prove some new…
▽ More
As a non-trivial extension of the celebrated Cheeger inequality, the higher-order Cheeger inequalities for graphs due to Lee, Oveis Gharan and Trevisan provide for each $k$ an upper bound for the $k$-way Cheeger constant in forms of $C(k)\sqrt{λ_k(G)}$, where $λ_k(G)$ is the $k$-th eigenvalue of the graph Laplacian and $C(k)$ is a constant depending only on $k$. In this article, we prove some new bounds for multi-way Cheeger constants. By shifting the index of the eigenvalue via cyclomatic number, we establish upper bound estimates with an absolute constant instead of $C(k)$. This, in particular, gives a more direct proof of Miclo's higher order Cheeger inequalities on trees. We also show a new lower bound of the multi-way Cheeger constants in terms of the spectral radius of the graph. The proofs involve the concept of discrete nodal domains and a probability argument showing generic properties of eigenfunctions.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Two Channels of Metal-Rich Compact Stellar System Formation: Starbursts Under High Ram Pressure vs. Tidal Stripping
Authors:
Yuan Bian,
Min Du,
Victor P. Debattista,
Dylan Nelson,
Mark A. Norris,
Luis C. Ho,
Shuai Lu,
Renyue Cen,
Shuo Ma,
Chong Ge,
Taotao Fang,
Hui Li
Abstract:
Most galaxies follow well-defined scaling relations of metallicity and stellar mass; however, some outliers at the low mass end of the observed galaxy population exhibit unusually high metallicity for their mass. Understanding how these objects get to be so metal-rich is vital for understanding the role of feedback in galaxy formation. Using the TNG50 simulation, we explore the origins of this phe…
▽ More
Most galaxies follow well-defined scaling relations of metallicity and stellar mass; however, some outliers at the low mass end of the observed galaxy population exhibit unusually high metallicity for their mass. Understanding how these objects get to be so metal-rich is vital for understanding the role of feedback in galaxy formation. Using the TNG50 simulation, we explore the origins of this phenomenon. We identify 227 metal-rich, Compact Stellar Systems (CSSs) that deviate significantly from this scaling relation. These CSSs are satellites located in the vicinity of massive host galaxies, with stellar masses ranging from $10^{8} M_{\odot}$ to $10^{10} M_{\odot}$ (including six systems that are close analogs of the M31-M32 system). Contrary to the previously assumed scenario that such objects are predominantly products of tidal stripping, our results suggest a more prevalent role for ram pressure in their formation. Indeed, 76\% (173) of these CSSs are formed through a burst of star formation occurring around the time of the first pericentric passage, typically at redshifts $z\lesssim1$, aided by strong ram pressure and tidal forces. The high ram pressure, resulting from the CSSs' rapid motion near the halo center, facilitates metal enrichment, producing high-metallicity CSSs by confining the metal-rich gas from bursty star formation, which leads to distinct stellar populations characterized by enhanced metallicity as well as high $α$-abundance. Only the remaining 24\% (54) of metal-rich CSSs are generated through the tidal stripping of massive progenitors. Our results further indicate that M32 is more likely to have formed through intense star formation events rather than through gradual, tidal stripping, thereby providing crucial insights into the nature of low mass, compact galaxy formation.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
Automatic Organ and Pan-cancer Segmentation in Abdomen CT: the FLARE 2023 Challenge
Authors:
Jun Ma,
Yao Zhang,
Song Gu,
Cheng Ge,
Ershuai Wang,
Qin Zhou,
Ziyan Huang,
Pengju Lyu,
Jian He,
Bo Wang
Abstract:
Organ and cancer segmentation in abdomen Computed Tomography (CT) scans is the prerequisite for precise cancer diagnosis and treatment. Most existing benchmarks and algorithms are tailored to specific cancer types, limiting their ability to provide comprehensive cancer analysis. This work presents the first international competition on abdominal organ and pan-cancer segmentation by providing a lar…
▽ More
Organ and cancer segmentation in abdomen Computed Tomography (CT) scans is the prerequisite for precise cancer diagnosis and treatment. Most existing benchmarks and algorithms are tailored to specific cancer types, limiting their ability to provide comprehensive cancer analysis. This work presents the first international competition on abdominal organ and pan-cancer segmentation by providing a large-scale and diverse dataset, including 4650 CT scans with various cancer types from over 40 medical centers. The winning team established a new state-of-the-art with a deep learning-based cascaded framework, achieving average Dice Similarity Coefficient scores of 92.3% for organs and 64.9% for lesions on the hidden multi-national testing set. The dataset and code of top teams are publicly available, offering a benchmark platform to drive further innovations https://codalab.lisn.upsaclay.fr/competitions/12239.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
Voltran: Unlocking Trust and Confidentiality in Decentralized Federated Learning Aggregation
Authors:
Hao Wang,
Yichen Cai,
Jun Wang,
Chuan Ma,
Chunpeng Ge,
Xiangmou Qu,
Lu Zhou
Abstract:
The decentralized Federated Learning (FL) paradigm built upon blockchain architectures leverages distributed node clusters to replace the single server for executing FL model aggregation. This paradigm tackles the vulnerability of the centralized malicious server in vanilla FL and inherits the trustfulness and robustness offered by blockchain. However, existing blockchain-enabled schemes face chal…
▽ More
The decentralized Federated Learning (FL) paradigm built upon blockchain architectures leverages distributed node clusters to replace the single server for executing FL model aggregation. This paradigm tackles the vulnerability of the centralized malicious server in vanilla FL and inherits the trustfulness and robustness offered by blockchain. However, existing blockchain-enabled schemes face challenges related to inadequate confidentiality on models and limited computational resources of blockchains to perform large-scale FL computations. In this paper, we present Voltran, an innovative hybrid platform designed to achieve trust, confidentiality, and robustness for FL based on the combination of the Trusted Execution Environment (TEE) and blockchain technology. We offload the FL aggregation computation into TEE to provide an isolated, trusted and customizable off-chain execution, and then guarantee the authenticity and verifiability of aggregation results on the blockchain. Moreover, we provide strong scalability on multiple FL scenarios by introducing a multi-SGX parallel execution strategy to amortize the large-scale FL workload. We implement a prototype of Voltran and conduct a comprehensive performance evaluation. Extensive experimental results demonstrate that Voltran incurs minimal additional overhead while guaranteeing trust, confidentiality, and authenticity, and it significantly brings a significant speed-up compared to state-of-the-art ciphertext aggregation schemes.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Post-Measurement Pairing Quantum Key Distribution with Local Optical Frequency Standard
Authors:
Chengfang Ge,
Lai Zhou,
Jinping Lin,
Hua-Lei Yin,
Qiang Zeng,
Zhiliang Yuan
Abstract:
The idea of post-measurement coincidence pairing simplifies substantially long-distance, repeater-like quantum key distribution (QKD) by eliminating the need for tracking the differential phase of the users' lasers. However, optical frequency tracking remains necessary and can become a severe burden in future deployment of multi-node quantum networks. Here, we resolve this problem by referencing e…
▽ More
The idea of post-measurement coincidence pairing simplifies substantially long-distance, repeater-like quantum key distribution (QKD) by eliminating the need for tracking the differential phase of the users' lasers. However, optical frequency tracking remains necessary and can become a severe burden in future deployment of multi-node quantum networks. Here, we resolve this problem by referencing each user's laser to an absolute frequency standard and demonstrate a practical post-measurement pairing QKD with excellent long-term stability. We confirm the setup's repeater-like behavior and achieve a finite-size secure key rate (SKR) of 15.94 bit/s over 504 km fiber, which overcomes the absolute repeaterless bound by 1.28 times. Over a fiber length 100 km, the setup delivers an impressive SKR of 285.68 kbit/s. Our work paves the way towards an efficient muti-user quantum network with the local frequency standard.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development
Authors:
Daoyuan Chen,
Haibin Wang,
Yilun Huang,
Ce Ge,
Yaliang Li,
Bolin Ding,
Jingren Zhou
Abstract:
The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we pre…
▽ More
The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed "Probe-Analyze-Refine" workflow, validated through applications on state-of-the-art LLaVA-like and DiT based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. With the hope of fostering deeper understanding and future progress in multi-modal data and generative modeling, our codes, datasets, and models are maintained and accessible at https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning
Authors:
Yiheng Li,
Chongjian Ge,
Chenran Li,
Chenfeng Xu,
Masayoshi Tomizuka,
Chen Tang,
Mingyu Ding,
Wei Zhan
Abstract:
We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with a focus on describing and reasoning interactions and intentions in driving scenarios. Previous language datasets primarily captured interactions caused by close distances. However, interactions induced by traffic rules and human intentions, which can occur over long distances, are yet…
▽ More
We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with a focus on describing and reasoning interactions and intentions in driving scenarios. Previous language datasets primarily captured interactions caused by close distances. However, interactions induced by traffic rules and human intentions, which can occur over long distances, are yet sufficiently covered, despite being very common and more challenging for prediction or planning models to understand. Therefore, our WOMD-Reasoning focuses extensively on these interactions, providing a total of 409k Q&As for varying types of interactions. Additionally, WOMD-Reasoning presents by far the largest Q&A dataset on real-world driving scenarios, with around 3 million Q&As covering various topics of autonomous driving from map descriptions, motion status descriptions, to narratives and analyses of agents' interactions, behaviors, and intentions. This extensive textual information enables fine-tuning driving-related Large Language Models (LLMs) for a wide range of applications like scene description, prediction, planning, etc. By incorporating interaction and intention language from WOMD-Reasoning, we see significant enhancements in the performance of the state-of-the-art trajectory prediction model, Multipath++, with improvements of 10.14% in $MR_6$ and 6.90% in $minFDE_6$, proving the effectiveness of WOMD-Reasoning. We hope WOMD-Reasoning would empower LLMs in driving to offer better interaction understanding and behavioral reasoning. The dataset is available on https://waymo.com/open/download .
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Secure Outsourced Decryption for FHE-based Privacy-preserving Cloud Computing
Authors:
Xirong Ma,
Chuan Li,
Yuchang Hu,
Yunting Tao,
Yali Jiang,
Yanbin Li,
Fanyu Kong,
Chunpeng Ge
Abstract:
The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for saf…
▽ More
The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for safeguarding data privacy, enabling encrypted data to be processed securely in the cloud. However, the encryption and decryption routines of some HE schemes require considerable computational resources, presenting non-trivial work for clients. In this paper, we propose an outsourced decryption protocol for the prevailing RLWE-based fully homomorphic encryption schemes. The protocol splits the original decryption into two routines, with the computationally intensive part executed remotely by the cloud. Its security relies on an invariant of the NTRU-search problem with a newly designed blinding key distribution. Cryptographic analyses are conducted to configure protocol parameters across varying security levels. Our experiments demonstrate that the proposed protocol achieves up to a $67\%$ acceleration in the client's local decryption, accompanied by a $50\%$ reduction in space usage.
△ Less
Submitted 9 July, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
Quick and Simple Kernel Differential Equation Regression Estimators for Data with Sparse Design
Authors:
Chunlei Ge,
W. John Braun
Abstract:
Local polynomial regression of order at least one often performs poorly in regions of sparse data. Local constant regression is exceptional in this regard, though it is the least accurate method in general, especially at the boundaries of the data. Incorporating information from differential equations which may approximately or exactly hold is one way of extending the sparse design capacity of loc…
▽ More
Local polynomial regression of order at least one often performs poorly in regions of sparse data. Local constant regression is exceptional in this regard, though it is the least accurate method in general, especially at the boundaries of the data. Incorporating information from differential equations which may approximately or exactly hold is one way of extending the sparse design capacity of local constant regression while reducing bias and variance. A nonparametric regression method that exploits first order differential equations is introduced in this paper and applied to noisy mouse tumour growth data. Asymptotic biases and variances of kernel estimators using Taylor polynomials with different degrees are discussed. Model comparison is performed for different estimators through simulation studies under various scenarios which simulate exponential-type growth.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
Authors:
Jiayi Guo,
Junhao Zhao,
Chunjiang Ge,
Chaoqun Du,
Zanlin Ni,
Shiji Song,
Humphrey Shi,
Gao Huang
Abstract:
Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona…
▽ More
Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditional diffusion model, which is also trained on the source domain to transform target data into synthetic data as a source domain projection. This allows the source model to make predictions without weight adaptation. In this paper, we argue that the domains of the source model and the synthetic data in diffusion-driven TTA methods are not aligned. To adapt the source model to the synthetic domain of the unconditional diffusion model, we introduce a Synthetic-Domain Alignment (SDA) framework to fine-tune the source model with synthetic data. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This process mitigates the potential domain gap between the conditional and unconditional models. Extensive experiments across various models and benchmarks demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Demystify Mamba in Vision: A Linear Attention Perspective
Authors:
Dongchen Han,
Ziyi Wang,
Zhuofan Xia,
Yizeng Han,
Yifan Pu,
Chunjiang Ge,
Jun Song,
Shiji Song,
Bo Zheng,
Gao Huang
Abstract:
Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similar…
▽ More
Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba's success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https://github.com/LeapLabTHU/MLLA.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Authors:
Chunjiang Ge,
Sijie Cheng,
Ziming Wang,
Jiale Yuan,
Yuan Gao,
Jun Song,
Shiji Song,
Gao Huang,
Bo Zheng
Abstract:
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which emplo…
▽ More
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining
Authors:
Ce Ge,
Zhijian Ma,
Daoyuan Chen,
Yaliang Li,
Bolin Ding
Abstract:
Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of str…
▽ More
Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed $\textbf{BiMix}$, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of $\textbf{BiMix}$. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.
△ Less
Submitted 11 July, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
RGB Guided ToF Imaging System: A Survey of Deep Learning-based Methods
Authors:
Xin Qiao,
Matteo Poggi,
Pengchao Deng,
Hao Wei,
Chenyang Ge,
Stefano Mattoccia
Abstract:
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifi…
▽ More
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior
Authors:
Zhiyuan Li,
Yanhui Zhou,
Hao Wei,
Chenyang Ge,
Jingwen Jiang
Abstract:
Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat t…
▽ More
Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.
△ Less
Submitted 3 September, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
Authors:
Marcos V. Conde,
Zhijun Lei,
Wen Li,
Cosmin Stejerean,
Ioannis Katsavounidis,
Radu Timofte,
Kihwan Yoon,
Ganzorig Gankhuyag,
Jiangtao Lv,
Long Sun,
Jinshan Pan,
Jiangxin Dong,
Jinhui Tang,
Zhiyuan Li,
Hao Wei,
Chenyang Ge,
Dongyang Zhang,
Tianle Liu,
Huaian Chen,
Yi Jin,
Menghan Zhou,
Yiqiang Yan,
Si Gao,
Biao Wu,
Shaoli Liu
, et al. (50 additional authors not shown)
Abstract:
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod…
▽ More
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Authors:
Ruyi Xu,
Yuan Yao,
Zonghao Guo,
Junbo Cui,
Zanlin Ni,
Chunjiang Ge,
Tat-Seng Chua,
Zhiyuan Liu,
Maosong Sun,
Gao Huang
Abstract:
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in t…
▽ More
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Authors:
Junsong Chen,
Chongjian Ge,
Enze Xie,
Yue Wu,
Lewei Yao,
Xiaozhe Ren,
Zhongdao Wang,
Ping Luo,
Huchuan Lu,
Zhenguo Li
Abstract:
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α,…
▽ More
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-Σare twofold: (1) High-Quality Training Data: PixArt-Σincorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σachieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
△ Less
Submitted 17 March, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
Authors:
Lukas Gianinazzi,
Alexandros Nikolaos Ziogas,
Langwen Huang,
Piotr Luczynski,
Saleh Ashkboos,
Florian Scheidl,
Armon Carigiet,
Chio Ge,
Nabil Abubaker,
Maciej Besta,
Tal Ben-Nun,
Torsten Hoefler
Abstract:
We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit th…
▽ More
We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
△ Less
Submitted 20 March, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Authors:
Yao Mu,
Junting Chen,
Qinglong Zhang,
Shoufa Chen,
Qiaojun Yu,
Chongjian Ge,
Runjian Chen,
Zhixuan Liang,
Mengkang Hu,
Chaofan Tao,
Peize Sun,
Haibao Yu,
Chao Yang,
Wenqi Shao,
Wenhai Wang,
Jifeng Dai,
Yu Qiao,
Mingyu Ding,
Ping Luo
Abstract:
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various…
▽ More
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Traditional Transformation Theory Guided Model for Learned Image Compression
Authors:
Zhiyuan Li,
Chenyang Ge,
Shun Li
Abstract:
Recently, many deep image compression methods have been proposed and achieved remarkable performance. However, these methods are dedicated to optimizing the compression performance and speed at medium and high bitrates, while research on ultra low bitrates is limited. In this work, we propose a ultra low bitrates enhanced invertible encoding network guided by traditional transformation theory, exp…
▽ More
Recently, many deep image compression methods have been proposed and achieved remarkable performance. However, these methods are dedicated to optimizing the compression performance and speed at medium and high bitrates, while research on ultra low bitrates is limited. In this work, we propose a ultra low bitrates enhanced invertible encoding network guided by traditional transformation theory, experiments show that our codec outperforms existing methods in both compression and reconstruction performance. Specifically, we introduce the Block Discrete Cosine Transformation to model the sparsity of features and employ traditional Haar transformation to improve the reconstruction performance of the model without increasing the bitstream cost.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Diagnosing the particle transport mechanism in the pulsar halo via X-ray observations
Authors:
Qi-Zuo Wu,
Chao-Ming Li,
Xuan-Han Liang,
Chong Ge,
Ruo-Yu Liu
Abstract:
Pulsar halos (also termed 'TeV halo') are a new class of $γ$-ray sources in Galaxy, which manifest as extended $γ$-ray emission around middle-age pulsars, as discovered around the Geminga pulsar, the Monogem pulsar and PSR~J0622+3749 by HAWC and LHAASO. A consensus has been reached that the TeV emission comes from the inverse Compton scattering of escaping electrons/positrons from the PWN off soft…
▽ More
Pulsar halos (also termed 'TeV halo') are a new class of $γ$-ray sources in Galaxy, which manifest as extended $γ$-ray emission around middle-age pulsars, as discovered around the Geminga pulsar, the Monogem pulsar and PSR~J0622+3749 by HAWC and LHAASO. A consensus has been reached that the TeV emission comes from the inverse Compton scattering of escaping electrons/positrons from the PWN off soft background radiation field, while the particle transport mechanism in the halo is still in dispute. Currently, there are mainly three interpretations, namely, the isotropic, suppressed diffusion model; the isotropic, unsuppressed diffusion model with considering ballistic propagation of newly injected particles; the anisotropic diffusion model. While the predicted gamma-ray surface brightness profiles by all three models can be more or less consistent with the observation, the implication of the three models for cosmic-ray transport mechanisms and the properties of interstellar magnetic field are quite different. In this study, we calculate the anticipated X-ray emission of pulsar halos under the three models. We show that the synchrotron radiation of these escaping electrons can produce a corresponding X-ray halo around the pulsar, and the expected surface brightness profiles are distinct in three models. We suggest that sensitive X-ray detectors of a large field of view (such as eROSITA and Einstein Probe) with a reasonably long exposure time are crucial to understand the formation mechanism of pulsar halos and serve as a probe to the properties of the interstellar turbulence.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
PeVatron Candidate SNR G106.3+2.7 in a Low-density Cavity: a Multiwavelength Test
Authors:
Yiwei Bao,
Ruo-Yu Liu,
Chong Ge,
Yang Chen
Abstract:
In this paper, we constrain the density of the interstellar medium (ISM) around the hadronic PeVatron candidate, supernova remnant (SNR) G106.3+2.7, based on X-ray and $γ$-ray observations. The purpose of this investigation is to understand the influence of the gaseous environment on this SNR as a proton PeVatron candidate. By modelling the self-regulated propagation of the CRs injected from the S…
▽ More
In this paper, we constrain the density of the interstellar medium (ISM) around the hadronic PeVatron candidate, supernova remnant (SNR) G106.3+2.7, based on X-ray and $γ$-ray observations. The purpose of this investigation is to understand the influence of the gaseous environment on this SNR as a proton PeVatron candidate. By modelling the self-regulated propagation of the CRs injected from the SNR, we calculate the $γ$-ray emission of CRs via the hadronuclear interactions with the molecular cloud and the ISM, and use the measured $γ$-ray flux to constrain the ISM density around the SNR. Our results support the picture that the SNR is expanding into a low-density ($n<0.05 cm^{-3}$) cavity, enabling the SNR to be a potential proton PeVatron despite that it presently is not in the very early phase.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Approximate Integer Solution Counts over Linear Arithmetic Constraints
Authors:
Cunjing Ge
Abstract:
Counting integer solutions of linear constraints has found interesting applications in various fields. It is equivalent to the problem of counting lattice points inside a polytope. However, state-of-the-art algorithms for this problem become too slow for even a modest number of variables. In this paper, we propose a new framework to approximate the lattice counts inside a polytope with a new rando…
▽ More
Counting integer solutions of linear constraints has found interesting applications in various fields. It is equivalent to the problem of counting lattice points inside a polytope. However, state-of-the-art algorithms for this problem become too slow for even a modest number of variables. In this paper, we propose a new framework to approximate the lattice counts inside a polytope with a new random-walk sampling method. The counts computed by our approach has been proved approximately bounded by a $(ε, δ)$-bound. Experiments on extensive benchmarks show that our algorithm could solve polytopes with dozens of dimensions, which significantly outperforms state-of-the-art counters.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention
Authors:
Lujia Shen,
Yuwen Pu,
Shouling Ji,
Changjiang Li,
Xuhong Zhang,
Chunpeng Ge,
Ting Wang
Abstract:
Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and…
▽ More
Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.
△ Less
Submitted 29 November, 2023; v1 submitted 29 November, 2023;
originally announced November 2023.
-
Metal-to-insulator transition in oxide semimetals by anion doping
Authors:
Haitao Hong,
Huimin Zhang,
Shan Lin,
Jeffrey A. Dhas,
Binod Paudel,
Shuai Xu,
Shengru Chen,
Ting Cui,
Yiyan Fan,
Dongke Rong,
Qiao Jin,
Zihua Zhu,
Yingge Du,
Scott A. Chambers,
Chen Ge,
Can Wang,
Qinghua Zhang,
Le Wang,
Kui-juan Jin,
Shuai Dong,
Er-Jia Guo
Abstract:
Oxide semimetals exhibiting both nontrivial topological characteristics stand as exemplary parent compounds and multiple degrees of freedom, offering great promise for the realization of novel electronic states. In this study, we present compelling evidence of profound structural and transport phase shifts in a recently uncovered oxide semimetal, SrNbO3, achieved through effective in-situ anion do…
▽ More
Oxide semimetals exhibiting both nontrivial topological characteristics stand as exemplary parent compounds and multiple degrees of freedom, offering great promise for the realization of novel electronic states. In this study, we present compelling evidence of profound structural and transport phase shifts in a recently uncovered oxide semimetal, SrNbO3, achieved through effective in-situ anion doping. Notably, a remarkable increase in resistivity of more than three orders of magnitude at room temperature is observed upon nitrogen-doping. The extent of electronic modulation in SrNbO3 is strongly correlated with the misfit strain, underscoring its phase instability to both chemical doping and crystallographic symmetry variations. Using first-principles calculations, we discern that elevating the level of nitrogen doping induces an upward shift in the conductive bands of SrNbO3-dNd. Consequently, a transition from a metallic state to an insulating state becomes apparent as the nitrogen concentration reaches a threshold of 1/3. This investigation sheds light on the potential of anion engineering in oxide semimetals, offering pathways for manipulating their physical properties. These insights hold promise for future applications that harness these materials for tailored functionalities.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Advancing Vision Transformers with Group-Mix Attention
Authors:
Chongjian Ge,
Xiaohan Ding,
Zhan Tong,
Li Yuan,
Jiangliu Wang,
Yibing Song,
Ping Luo
Abstract:
Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have…
▽ More
Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Authors:
Yuanfeng Ji,
Chongjian Ge,
Weikai Kong,
Enze Xie,
Zhengying Liu,
Zhengguo Li,
Ping Luo
Abstract:
With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of th…
▽ More
With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
Authors:
Kai Yang,
Jian Tao,
Jiafei Lyu,
Chunjiang Ge,
Jiaxin Chen,
Qimai Li,
Weihan Shen,
Xiaolong Zhu,
Xiu Li
Abstract:
Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, mak…
▽ More
Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO.
△ Less
Submitted 23 March, 2024; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Strain mediated phase crossover in Ruddlesden Popper nickelates
Authors:
Ting Cui,
Songhee Choi,
Ting Lin,
Chen Liu,
Gang Wang,
Ningning Wang,
Shengru Chen,
Haitao Hong,
Dongke Rong,
Qianying Wang,
Qiao Jin,
Jia-Ou Wang,
Lin Gu,
Chen Ge,
Can Wang,
Jin Guang Cheng,
Qinghua Zhang,
Liang Si,
Kui-juan Jin,
Er-Jia Guo
Abstract:
Recent progress on the signatures of pressure-induced high temperature superconductivity in Ruddlesden Popper (RP) nickelates (Lan+1NinO3n+1) has attracted growing interest in both theoretical calculations and experimental efforts. The fabrication of high-quality single crystalline RP nickelate thin films is critical for possible reducing the superconducting transition pressure and advancing appli…
▽ More
Recent progress on the signatures of pressure-induced high temperature superconductivity in Ruddlesden Popper (RP) nickelates (Lan+1NinO3n+1) has attracted growing interest in both theoretical calculations and experimental efforts. The fabrication of high-quality single crystalline RP nickelate thin films is critical for possible reducing the superconducting transition pressure and advancing applications in microelectronics in the future. In this study, we report the observations of an active phase transition in RP nickelate films induced by misfit strain. We found that RP nickelate films favor the perovskite structure (n = infinite) under tensile strains, while compressive strains stabilize the La3Ni2O7 (n = 2) phase. The selection of distinct phases is governed by the strain dependent formation energy and electronic configuration. In compressively strained La3Ni2O7, we experimentally determined splitting energy is ~0.2 eV and electrons prefer to occupy in-plane orbitals. First principles calculations unveil a robust coupling between strain effects and the valence state of Ni ions in RP nickelates, suggesting a dual driving force for the inevitable phase co-existence transition in RP nickelates. Our work underscores the sensitivity of RP nickelate formation to epitaxial strain, presenting a significant challenge in fabricating pure-phase RP nickelate films. Therefore, special attention to stacking defects and grain boundaries between different RP phases is essential when discussing the pressure-induced superconductivity in RP nickelates.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
New graph invariants based on $p$-Laplacian eigenvalues
Authors:
Chuanyuan Ge,
Shiping Liu,
Dong Zhang
Abstract:
We present monotonicity inequalities for certain functions involving eigenvalues of $p$-Laplacians on signed graphs with respect to $p$. Inspired by such monotonicity, we propose new spectrum-based graph invariants, called (variational) cut-off adjacency eigenvalues, that are relevant to certain eigenvector-dependent nonlinear eigenvalue problem. Using these invariants, we obtain new lower bounds…
▽ More
We present monotonicity inequalities for certain functions involving eigenvalues of $p$-Laplacians on signed graphs with respect to $p$. Inspired by such monotonicity, we propose new spectrum-based graph invariants, called (variational) cut-off adjacency eigenvalues, that are relevant to certain eigenvector-dependent nonlinear eigenvalue problem. Using these invariants, we obtain new lower bounds for the $p$-Laplacian variational eigenvalues, essentially giving the state-of-the-art spectral asymptotics for these eigenvalues. Moreover, based on such invariants, we establish two inertia bounds regarding the cardinalities of a maximum independent set and a minimum edge cover, respectively. The first inertia bound enhances the classical Cvetković bound, and the second one implies that the $k$-th $p$-Laplacian variational eigenvalue is of the order $2^p$ as $p$ tends to infinity whenever $k$ is larger than the cardinality of a minimum edge cover of the underlying graph. We further discover an interesting connection between graph $p$-Laplacian eigenvalues and tensor eigenvalues and discuss applications of our invariants to spectral problems of tensors.
△ Less
Submitted 31 October, 2023; v1 submitted 12 October, 2023;
originally announced October 2023.
-
InstructDET: Diversifying Referring Object Detection with Generalized Instructions
Authors:
Ronghao Dang,
Jiangyan Feng,
Haodong Zhang,
Chongjian Ge,
Lin Song,
Lijun Gong,
Chengju Liu,
Qijun Chen,
Feng Zhu,
Rui Zhao,
Yibing Song
Abstract:
We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and diff…
▽ More
We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We name our constructed dataset as InDET. It contains images, bbxs and generalized instructions that are from foundation models. Our InDET is developed from existing REC datasets and object detection datasets, with the expanding potential that any image with object bbxs can be incorporated through using our InstructDET method. By using our InDET dataset, we show that a conventional ROD model surpasses existing methods on standard REC datasets and our InDET test set. Our data-centric method InstructDET, with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions.
△ Less
Submitted 11 March, 2024; v1 submitted 8 October, 2023;
originally announced October 2023.
-
PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Authors:
Junsong Chen,
Jincheng Yu,
Chongjian Ge,
Lewei Yao,
Enze Xie,
Yue Wu,
Zhongdao Wang,
James Kwok,
Ping Luo,
Huchuan Lu,
Zhenguo Li
Abstract:
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$α$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and eve…
▽ More
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$α$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$α$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$α$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$α$ excels in image quality, artistry, and semantic control. We hope PIXART-$α$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
△ Less
Submitted 29 December, 2023; v1 submitted 30 September, 2023;
originally announced October 2023.
-
A Variational Spike-and-Slab Approach for Group Variable Selection
Authors:
Buyu Lin,
Changhao Ge,
Jun S. Liu
Abstract:
We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method…
▽ More
We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method can be easily extended to fitting additive models. Theoretically, we present general conditions on the generic spike-and-slab priors that enable us to derive the contraction rates for both the true posterior and the VB posterior for linear regression and additive models, of which some previous theoretical results can be viewed as special cases. Our simulation studies and real data application demonstrate that the proposed method is superior to existing methods in both variable selection and parameter estimation. Our algorithm is implemented in the R package GVSSB.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Authors:
Jiangliu Wang,
Jianbo Jiao,
Yibing Song,
Stephen James,
Zhan Tong,
Chongjian Ge,
Pieter Abbeel,
Yun-hui Liu
Abstract:
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-vi…
▽ More
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Constraining baryon loading efficiency of AGNs with diffuse neutrino flux from galaxy clusters
Authors:
Xin-Yue Shi,
Ruo-Yu Liu,
Chong Ge,
Xiang-Yu Wang
Abstract:
The active galactic nuclei (AGNs) are widely believed to be one of the promising acceleration sites of ultrahigh-energy cosmic rays (CRs). Essentially, AGNs are powered by the gravitational energy of matter falling to supermassive black holes. However, the conversion efficiency of gravitational to kinetic energy of CRs in AGNs, which is defined as baryon loading factor $η_p$, is not well known yet…
▽ More
The active galactic nuclei (AGNs) are widely believed to be one of the promising acceleration sites of ultrahigh-energy cosmic rays (CRs). Essentially, AGNs are powered by the gravitational energy of matter falling to supermassive black holes. However, the conversion efficiency of gravitational to kinetic energy of CRs in AGNs, which is defined as baryon loading factor $η_p$, is not well known yet. After being accelerated, high-energy CRs could escape the host galaxy and enter the intra-cluster medium (ICM). These CRs can be confined within the galaxy cluster and produce $γ$-rays and neutrinos through proton-proton collisions with the ICM. In this paper, we study the diffusion of CRs in galaxy clusters and calculate the diffuse neutrino flux from galaxy cluster population. Using the latest upper limits on the cumulative unresolved TeV-PeV neutrino flux from galaxy clusters posed by the IceCube Neutrino Observatory, we derive the upper limit of the average baryon loading factor as $η_{p,\mathrm{grav}} \lesssim 2 \times 10^{-3} - 0.1$ for the population of galaxy clusters. This constraint is more stringent than the one obtained from $γ$-ray observation on the Coma cluster.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Data-Juicer: A One-Stop Data Processing System for Large Language Models
Authors:
Daoyuan Chen,
Yilun Huang,
Zhijian Ma,
Hesen Chen,
Xuchen Pan,
Ce Ge,
Dawei Gao,
Yuexiang Xie,
Zhaoyang Liu,
Jinyang Gao,
Yaliang Li,
Bolin Ding,
Jingren Zhou
Abstract:
The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LL…
▽ More
The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LLMs, incorporate data from new sources, and improve LLMs' performance, we build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes, explore different possibilities in forming data mixtures, and evaluate their effects on model performance. Different from traditional data-analytics pipelines, Data-Juicer faces some unique challenges. Firstly, the possible data sources for forming data recipes are truly heterogeneous and massive with various qualities. Secondly, it is extremely expensive to precisely evaluate data recipes' impact on LLMs' performance. Thirdly, the end users of Data-Juicer, model developers, need sufficient flexibility to configure and evaluate different data recipes.
Data-Juicer features a fine-grained abstraction of pipelines for constructing data recipes, with over 50 built-in operators for easy composition and extension. By incorporating visualization and auto-evaluation capabilities, Data-Juicer enables a timely feedback loop for both LLM pre-training and fine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems for LLM training, evaluation, and distributed computing. The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by up to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5% higher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and tutorials are released, calling for broader data-centric research on training and understanding LLMs.
△ Less
Submitted 20 December, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
The Multi-modality Cell Segmentation Challenge: Towards Universal Solutions
Authors:
Jun Ma,
Ronald Xie,
Shamini Ayyadhury,
Cheng Ge,
Anubha Gupta,
Ritu Gupta,
Song Gu,
Yao Zhang,
Gihun Lee,
Joonkee Kim,
Wei Lou,
Haofeng Li,
Eric Upschulte,
Timo Dickscheid,
José Guilherme de Almeida,
Yixin Wang,
Lin Han,
Xin Yang,
Marco Labagnara,
Vojislav Gligorovski,
Maxime Scheder,
Sahand Jamal Rahi,
Carly Kempster,
Alice Pollitt,
Leon Espinosa
, et al. (15 additional authors not shown)
Abstract:
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyper-parameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diver…
▽ More
Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyper-parameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.
△ Less
Submitted 1 April, 2024; v1 submitted 10 August, 2023;
originally announced August 2023.
-
Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge
Authors:
Jun Ma,
Yao Zhang,
Song Gu,
Cheng Ge,
Shihao Ma,
Adamo Young,
Cheng Zhu,
Kangkang Meng,
Xin Yang,
Ziyan Huang,
Fan Zhang,
Wentao Liu,
YuanKe Pan,
Shoujin Huang,
Jiacheng Wang,
Mingze Sun,
Weixin Xu,
Dengqiang Jia,
Jae Won Choi,
Natália Alves,
Bram de Wilde,
Gregor Koehler,
Yajun Wu,
Manuel Wiesenfarth,
Qiongjie Zhu
, et al. (4 additional authors not shown)
Abstract:
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations,…
▽ More
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
A detached double X-ray tail in the merging galaxy cluster Z8338 with a large double tail
Authors:
Chong Ge,
Ming Sun,
Paul E. J. Nulsen,
Craig Sarazin,
Maxim Markevitch,
Gerrit Schellenberger
Abstract:
When subhalos infall into galaxy clusters, their gas content is ram pressure stripped by the intracluster medium (ICM) and may turn into cometary tails. We report the discovery of two spectacular X-ray double tails in a single galaxy cluster, Z8338, revealed by 70 ks Chandra observations. The brighter one, with an X-ray bolometric luminosity of $3.9 \times 10^{42}{\rm\ erg\ s}^{-1}$, is a detached…
▽ More
When subhalos infall into galaxy clusters, their gas content is ram pressure stripped by the intracluster medium (ICM) and may turn into cometary tails. We report the discovery of two spectacular X-ray double tails in a single galaxy cluster, Z8338, revealed by 70 ks Chandra observations. The brighter one, with an X-ray bolometric luminosity of $3.9 \times 10^{42}{\rm\ erg\ s}^{-1}$, is a detached tail stripped from the host halo and extended at least 250 kpc in projection. The head of the detached tail is a cool core with the front tip of the cold front $\sim$ 30 kpc away from the nucleus of its former host galaxy. The cooling time of the detached cool core is $\sim 0.3$ Gyr. For the detached gas, the gravity of the once-associated dark matter halo further enhances the Rayleigh-Taylor (RT) instability. From its survival, we find that a magnetic field of a few $μ$G is required to suppress the hydrodynamic instability. The X-ray temperature in the tail increases from 0.9 keV at the front tip to 1.6 keV in the wake region, which suggests the turbulent mixing with the hotter ICM. The fainter double X-ray tail, with a total X-ray luminosity of $2.7 \times 10^{42}{\rm\ erg\ s}^{-1}$, appears to stem from the cool core of a subcluster in Z8338, and likely was formed during the ongoing merger. This example suggests that X-ray cool cores can be displaced and eventually destroyed by mergers, while the displaced cool cores can survive for some extended period of time.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
SoK: Privacy-Preserving Data Synthesis
Authors:
Yuzheng Hu,
Fan Wu,
Qinbin Li,
Yunhui Long,
Gonzalo Munilla Garrido,
Chang Ge,
Bolin Ding,
David Forsyth,
Bo Li,
Dawn Song
Abstract:
As the prevalence of data analysis grows, safeguarding data privacy has become a paramount concern. Consequently, there has been an upsurge in the development of mechanisms aimed at privacy-preserving data analyses. However, these approaches are task-specific; designing algorithms for new tasks is a cumbersome process. As an alternative, one can create synthetic data that is (ideally) devoid of pr…
▽ More
As the prevalence of data analysis grows, safeguarding data privacy has become a paramount concern. Consequently, there has been an upsurge in the development of mechanisms aimed at privacy-preserving data analyses. However, these approaches are task-specific; designing algorithms for new tasks is a cumbersome process. As an alternative, one can create synthetic data that is (ideally) devoid of private information. This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. Specifically, we put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods. Under the master recipe, we further dissect the statistical methods into choices of modeling and representation, and investigate the DL-based methods by different generative modeling principles. To consolidate our findings, we provide comprehensive reference tables, distill key takeaways, and identify open problems in the existing literature. In doing so, we aim to answer the following questions: What are the design principles behind different PPDS methods? How can we categorize these methods, and what are the advantages and disadvantages associated with each category? Can we provide guidelines for method selection in different real-world scenarios? We proceed to benchmark several prominent DL-based methods on the task of private image synthesis and conclude that DP-MERF is an all-purpose approach. Finally, upon systematizing the work over the past decade, we identify future directions and call for actions from researchers.
△ Less
Submitted 5 August, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Optimized Live 4K Video Multicast
Authors:
Zhaoyuan He,
Changhan Ge,
Wangyang Li,
Lili Qiu,
Peijie Li,
Ghufran Baig
Abstract:
4K videos are becoming increasingly popular. However, despite advances in wireless technology, streaming 4K videos over mmWave to multiple users is facing significant challenges arising from directional communication, unpredictable channel fluctuation and high bandwidth requirements. This paper develops a novel 4K layered video multicast system. We (i) develop a video quality model for layered vid…
▽ More
4K videos are becoming increasingly popular. However, despite advances in wireless technology, streaming 4K videos over mmWave to multiple users is facing significant challenges arising from directional communication, unpredictable channel fluctuation and high bandwidth requirements. This paper develops a novel 4K layered video multicast system. We (i) develop a video quality model for layered video coding, (ii) optimize resource allocation, scheduling, and beamforming based on the channel conditions of different users, and (iii) put forward a streaming strategy that uses fountain code to avoid redundancy across multicast groups and a Leaky-Bucket-based congestion control. We realize an end-to-end system on commodity-off-the-shelf (COTS) WiGig devices. We demonstrate the effectiveness of our system with extensive testbed experiments and emulation.
△ Less
Submitted 22 July, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
PreNAS: Preferred One-Shot Learning Towards Efficient Neural Architecture Search
Authors:
Haibin Wang,
Ce Ge,
Hesen Chen,
Xiuyu Sun
Abstract:
The wide application of pre-trained models is driving the trend of once-for-all training in one-shot neural architecture search (NAS). However, training within a huge sample space damages the performance of individual subnets and requires much computation to search for an optimal model. In this paper, we present PreNAS, a search-free NAS approach that accentuates target models in one-shot training…
▽ More
The wide application of pre-trained models is driving the trend of once-for-all training in one-shot neural architecture search (NAS). However, training within a huge sample space damages the performance of individual subnets and requires much computation to search for an optimal model. In this paper, we present PreNAS, a search-free NAS approach that accentuates target models in one-shot training. Specifically, the sample space is dramatically reduced in advance by a zero-cost selector, and weight-sharing one-shot training is performed on the preferred architectures to alleviate update conflicts. Extensive experiments have demonstrated that PreNAS consistently outperforms state-of-the-art one-shot NAS competitors for both Vision Transformer and convolutional architectures, and importantly, enables instant specialization with zero search cost. Our code is available at https://github.com/tinyvision/PreNAS.
△ Less
Submitted 16 June, 2023; v1 submitted 28 April, 2023;
originally announced April 2023.
-
MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation
Authors:
Chongjian Ge,
Junsong Chen,
Enze Xie,
Zhongdao Wang,
Lanqing Hong,
Huchuan Lu,
Zhenguo Li,
Ping Luo
Abstract:
Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, in real-world applications, sensor corruptions and failures lead to inferior performances, thus compromising autonomous safety. In this paper, we propose a robust framework, called MetaBEV, to address extreme real-world environments involving over…
▽ More
Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, in real-world applications, sensor corruptions and failures lead to inferior performances, thus compromising autonomous safety. In this paper, we propose a robust framework, called MetaBEV, to address extreme real-world environments involving overall six sensor corruptions and two extreme sensor-missing situations. In MetaBEV, signals from multiple sensors are first processed by modal-specific encoders. Subsequently, a set of dense BEV queries are initialized, termed meta-BEV. These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities. The updated BEV representations are further leveraged for multiple 3D prediction tasks. Additionally, we introduce a new M2oE structure to alleviate the performance drop on distinct tasks in multi-task joint learning. Finally, MetaBEV is evaluated on the nuScenes dataset with 3D object detection and BEV map segmentation tasks. Experiments show MetaBEV outperforms prior arts by a large margin on both full and corrupted modalities. For instance, when the LiDAR signal is missing, MetaBEV improves 35.5% detection NDS and 17.7% segmentation mIoU upon the vanilla BEVFusion model; and when the camera signal is absent, MetaBEV still achieves 69.2% NDS and 53.7% mIoU, which is even higher than previous works that perform on full-modalities. Moreover, MetaBEV performs fairly against previous methods in both canonical perception and multi-task learning settings, refreshing state-of-the-art nuScenes BEV map segmentation with 70.4% mIoU.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving
Authors:
Tianqi Wang,
Sukmin Kim,
Wenxuan Ji,
Enze Xie,
Chongjian Ge,
Junsong Chen,
Zhenguo Li,
Ping Luo
Abstract:
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes…
▽ More
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.
△ Less
Submitted 17 December, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning
Authors:
Chongjian Ge,
Jiangliu Wang,
Zhan Tong,
Shoufa Chen,
Yibing Song,
Ping Luo
Abstract:
Contrastive learning methods train visual encoders by comparing views from one instance to others. Typically, the views created from one instance are set as positive, while views from other instances are negative. This binary instance discrimination is studied extensively to improve feature representations in self-supervised learning. In this paper, we rethink the instance discrimination framework…
▽ More
Contrastive learning methods train visual encoders by comparing views from one instance to others. Typically, the views created from one instance are set as positive, while views from other instances are negative. This binary instance discrimination is studied extensively to improve feature representations in self-supervised learning. In this paper, we rethink the instance discrimination framework and find the binary instance labeling insufficient to measure correlations between different samples. For an intuitive example, given a random image instance, there may exist other images in a mini-batch whose content meanings are the same (i.e., belonging to the same category) or partially related (i.e., belonging to a similar category). How to treat the images that correlate similarly to the current image instance leaves an unexplored problem. We thus propose to support the current image by exploring other correlated instances (i.e., soft neighbors). We first carefully cultivate a candidate neighbor set, which will be further utilized to explore the highly-correlated instances. A cross-attention module is then introduced to predict the correlation score (denoted as positiveness) of other correlated instances with respect to the current one. The positiveness score quantitatively measures the positive support from each correlated instance, and is encoded into the objective for pretext training. To this end, our proposed method benefits in discriminating uncorrelated instances while absorbing correlated instances for SSL. We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation. The state-of-the-art recognition performance shows that SNCLR is effective in improving feature representations from both ViT and CNN encoders.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Revisiting the Chandra Observation on the Region of PSR J1809-1917: Indication of an X-ray Halo and Implication for the Origin of HESS J1809-193
Authors:
Chao-Ming Li,
Chong Ge,
Ruo-Yu Liu
Abstract:
HESS J1809-193 is an extended TeV $γ$-ray source and the origin of its $γ$-ray emission remains ambiguous. Pulsar wind nebula (PWN) of PSR J1809-1917 laying inside the extended $γ$-ray emission is a possible candidate. Powered by the central pulsar, ultrarelativistic electrons in PWN can produce radio to X-ray emission through synchrotron and $γ$-ray emission by inverse Compton (IC) scattering. To…
▽ More
HESS J1809-193 is an extended TeV $γ$-ray source and the origin of its $γ$-ray emission remains ambiguous. Pulsar wind nebula (PWN) of PSR J1809-1917 laying inside the extended $γ$-ray emission is a possible candidate. Powered by the central pulsar, ultrarelativistic electrons in PWN can produce radio to X-ray emission through synchrotron and $γ$-ray emission by inverse Compton (IC) scattering. To check whether this PWN is the counterpart of HESS J1809-193, we analyzed Chandra X-ray radial intensity profile and the spectral index profile of this PWN. We then adopt a one-zone isotropic diffusion model to fit the keV and the TeV data. We find diffuse nonthermal X-ray emission extending beyond PWN, which is likely an X-ray halo radiated by escaping electron/positron pairs from the PWN. A relatively strong magnetic field of $\sim 20\,μ$G is required to explain the spatial evolution of the X-ray spectrum (i.e., the significant softening of the spectrum with increasing distance from the pulsar), which, however, would suppress the IC radiation of pairs. Our result implies that a hadronic component may be needed to explain HESS J1809-193.
△ Less
Submitted 29 March, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Depth Super-Resolution from Explicit and Implicit High-Frequency Features
Authors:
Xin Qiao,
Chenyang Ge,
Youmin Zhang,
Yanhui Zhou,
Fabio Tosi,
Matteo Poggi,
Stefano Mattoccia
Abstract:
We propose a novel multi-stage depth super-resolution network, which progressively reconstructs high-resolution depth maps from explicit and implicit high-frequency features. The former are extracted by an efficient transformer processing both local and global contexts, while the latter are obtained by projecting color images into the frequency domain. Both are combined together with depth feature…
▽ More
We propose a novel multi-stage depth super-resolution network, which progressively reconstructs high-resolution depth maps from explicit and implicit high-frequency features. The former are extracted by an efficient transformer processing both local and global contexts, while the latter are obtained by projecting color images into the frequency domain. Both are combined together with depth features by means of a fusion strategy within a multi-stage and multi-scale framework. Experiments on the main benchmarks, such as NYUv2, Middlebury, DIML and RGBDD, show that our approach outperforms existing methods by a large margin (~20% on NYUv2 and DIML against the contemporary work DADA, with 16x upsampling), establishing a new state-of-the-art in the guided depth super-resolution task.
△ Less
Submitted 30 May, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Efficient and Low Overhead Website Fingerprinting Attacks and Defenses based on TCP/IP Traffic
Authors:
Guodong Huang,
Chuan Ma,
Ming Ding,
Yuwen Qian,
Chunpeng Ge,
Liming Fang,
Zhe Liu
Abstract:
Website fingerprinting attack is an extensively studied technique used in a web browser to analyze traffic patterns and thus infer confidential information about users. Several website fingerprinting attacks based on machine learning and deep learning tend to use the most typical features to achieve a satisfactory performance of attacking rate. However, these attacks suffer from several practical…
▽ More
Website fingerprinting attack is an extensively studied technique used in a web browser to analyze traffic patterns and thus infer confidential information about users. Several website fingerprinting attacks based on machine learning and deep learning tend to use the most typical features to achieve a satisfactory performance of attacking rate. However, these attacks suffer from several practical implementation factors, such as a skillfully pre-processing step or a clean dataset. To defend against such attacks, random packet defense (RPD) with a high cost of excessive network overhead is usually applied. In this work, we first propose a practical filter-assisted attack against RPD, which can filter out the injected noises using the statistical characteristics of TCP/IP traffic. Then, we propose a list-assisted defensive mechanism to defend the proposed attack method. To achieve a configurable trade-off between the defense and the network overhead, we further improve the list-based defense by a traffic splitting mechanism, which can combat the mentioned attacks as well as save a considerable amount of network overhead. In the experiments, we collect real-life traffic patterns using three mainstream browsers, i.e., Microsoft Edge, Google Chrome, and Mozilla Firefox, and extensive results conducted on the closed and open-world datasets show the effectiveness of the proposed algorithms in terms of defense accuracy and network efficiency.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
TextDefense: Adversarial Text Detection based on Word Importance Entropy
Authors:
Lujia Shen,
Xuhong Zhang,
Shouling Ji,
Yuwen Pu,
Chunpeng Ge,
Xing Yang,
Yanghe Feng
Abstract:
Currently, natural language processing (NLP) models are wildly used in various scenarios. However, NLP models, like all deep models, are vulnerable to adversarially generated text. Numerous works have been working on mitigating the vulnerability from adversarial attacks. Nevertheless, there is no comprehensive defense in existing works where each work targets a specific attack category or suffers…
▽ More
Currently, natural language processing (NLP) models are wildly used in various scenarios. However, NLP models, like all deep models, are vulnerable to adversarially generated text. Numerous works have been working on mitigating the vulnerability from adversarial attacks. Nevertheless, there is no comprehensive defense in existing works where each work targets a specific attack category or suffers from the limitation of computation overhead, irresistible to adaptive attack, etc.
In this paper, we exhaustively investigate the adversarial attack algorithms in NLP, and our empirical studies have discovered that the attack algorithms mainly disrupt the importance distribution of words in a text. A well-trained model can distinguish subtle importance distribution differences between clean and adversarial texts. Based on this intuition, we propose TextDefense, a new adversarial example detection framework that utilizes the target model's capability to defend against adversarial attacks while requiring no prior knowledge. TextDefense differs from previous approaches, where it utilizes the target model for detection and thus is attack type agnostic. Our extensive experiments show that TextDefense can be applied to different architectures, datasets, and attack methods and outperforms existing methods. We also discover that the leading factor influencing the performance of TextDefense is the target model's generalizability. By analyzing the property of the target model and the property of the adversarial example, we provide our insights into the adversarial attacks in NLP and the principles of our defense method.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.