(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 1,425 results for author: Huang, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.07388  [pdf, other

    cs.CL

    Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective

    Authors: Guimin Hu, Yi Xin, Weimin Lyu, Haojian Huang, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai

    Abstract: Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conv… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  2. arXiv:2409.07055  [pdf, other

    cs.CL cs.AI cs.CY

    Legal Fact Prediction: Task Definition and Dataset Construction

    Authors: Junkai Liu, Yujie Tong, Hui Huang, Shuyuan Zheng, Muyun Yang, Peicheng Wu, Makoto Onizuka, Chuan Xiao

    Abstract: Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and opt… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  3. arXiv:2409.06656  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

    Authors: Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

    Abstract: We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest err… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

  4. arXiv:2409.06633  [pdf, other

    cs.CV

    SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

    Authors: Teng Hu, Jiangning Zhang, Ran Yi, Hongrui Huang, Yabiao Wang, Lizhuang Ma

    Abstract: In recent years, the development of diffusion models has led to significant progress in image and video generation tasks, with pre-trained models like the Stable Diffusion series playing a crucial role. Inspired by model pruning which lightens large pre-trained models by removing unimportant parameters, we propose a novel model fine-tuning method to make full use of these ineffective parameters an… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

    Comments: Parameter efficient finetuning method

  5. arXiv:2409.06270  [pdf, other

    cs.LG cs.AI cs.CV

    Towards Robust Uncertainty-Aware Incomplete Multi-View Classification

    Authors: Mulin Chen, Haojian Huang, Qiang Li

    Abstract: Handling incomplete data in multi-view classification is challenging, especially when traditional imputation methods introduce biases that compromise uncertainty estimation. Existing Evidential Deep Learning (EDL) based approaches attempt to address these issues, but they often struggle with conflicting evidence due to the limitations of the Dempster-Shafer combination rule, leading to unreliable… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

    Comments: Ongoing work: 9 pages, 6 figures, 2 tables

  6. arXiv:2409.05929  [pdf, other

    cs.LG cs.AI

    Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models

    Authors: Hongyang Lei, Xiaolong Cheng, Dan Wang, Qi Qin, Huazhen Huang, Yetao Wu, Qingqing Gu, Zhonglin Jiang, Yong Chen, Luo Ji

    Abstract: Recent Large Multi-Modal Models (LMMs) have made significant advancements in multi-modal alignment by employing lightweight connection modules to facilitate the representation and fusion of knowledge from existing pre-trained uni-modal models. However, these methods still rely on modality-specific and direction-specific connectors, leading to compartmentalized knowledge representations and reduced… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: work in progress

  7. arXiv:2409.05587  [pdf, other

    cs.CV

    DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification

    Authors: Junzhou Chen, Zirui Zhang, Jing Yu, Heqiang Huang, Ronghui Zhang, Xuemiao Xu, Bin Sheng, Hong Yan

    Abstract: Driver distraction remains a leading cause of traffic accidents, posing a critical threat to road safety globally. As intelligent transportation systems evolve, accurate and real-time identification of driver distraction has become essential. However, existing methods struggle to capture both global contextual and fine-grained local features while contending with noisy labels in training datasets.… ▽ More

    Submitted 12 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

  8. arXiv:2409.05294  [pdf, other

    cs.CR cs.AI cs.LG

    TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors

    Authors: Yichuan Mo, Hui Huang, Mingjie Li, Ang Li, Yisen Wang

    Abstract: Diffusion models have achieved notable success in image generation, but they remain highly vulnerable to backdoor attacks, which compromise their integrity by producing specific undesirable outputs when presented with a pre-defined trigger. In this paper, we investigate how to protect diffusion models from this dangerous threat. Specifically, we propose TERD, a backdoor defense framework that buil… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

    Journal ref: International Conference on Machine Learning 2024

  9. arXiv:2409.03684  [pdf, ps, other

    quant-ph cs.DS cs.LG

    Predicting quantum channels over general product distributions

    Authors: Sitan Chen, Jaume de Dios Pont, Jun-Ting Hsieh, Hsin-Yuan Huang, Jane Lange, Jerry Li

    Abstract: We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $E$ and an observable $O$, we aim to learn the mapping \begin{equation*} ρろー\mapsto \mathrm{Tr}(O E[ρろー]) \end{equation*} to within a small error for most $ρろー$ sampled from a distribution $D$. Previously, Huang, Chen, and Preskill proved a surprising result that even if… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 20 pages, comments welcome

  10. arXiv:2409.03326  [pdf, other

    cs.CV

    Enhancing User-Centric Privacy Protection: An Interactive Framework through Diffusion Models and Machine Unlearning

    Authors: Huaxi Huang, Xin Yuan, Qiyu Liao, Dadong Wang, Tongliang Liu

    Abstract: In the realm of multimedia data analysis, the extensive use of image datasets has escalated concerns over privacy protection within such data. Current research predominantly focuses on privacy protection either in data sharing or upon the release of trained machine learning models. Our study pioneers a comprehensive privacy protection framework that safeguards image data privacy concurrently durin… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  11. arXiv:2409.03271  [pdf, other

    cs.AI cs.CL cs.HC

    Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

    Authors: Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, Ting Liu

    Abstract: The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge,… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  12. arXiv:2409.02050  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model

    Authors: Hukai Huang, Jiayan Lin, Kaidi Wang, Yishuang Li, Wenhao Guan, Lin Li, Qingyang Hong

    Abstract: Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects… ▽ More

    Submitted 5 September, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted by IEEE SLT 2024

  13. arXiv:2409.01726  [pdf, other

    cs.CV

    Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

    Authors: Qi Zhang, Kaiyi Zhang, Antoni B. Chan, Hui Huang

    Abstract: Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map s… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: ECCV 2024

  14. arXiv:2409.01706  [pdf, other

    quant-ph cs.CC math-ph

    Classically estimating observables of noiseless quantum circuits

    Authors: Armando Angrisani, Alexander Schmidhuber, Manuel S. Rudolph, M. Cerezo, Zoë Holmes, Hsin-Yuan Huang

    Abstract: We present a classical algorithm for estimating expectation values of arbitrary observables on most quantum circuits across all circuit architectures and depths, including those with all-to-all connectivity. We prove that for any architecture where each circuit layer is equipped with a measure invariant under single-qubit rotations, our algorithm achieves a small error $\varepsilon$ on all circuit… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Main text: 8 pages, 3 figures. Appendices: 25 pages, 1 figure

    Report number: LA-UR-24-29028

  15. arXiv:2409.01605  [pdf, other

    cs.IR cs.AI

    Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information

    Authors: Xinyu Zhang, Linmei Hu, Luhao Zhang, Dandan Song, Heyan Huang, Liqiang Nie

    Abstract: Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse chara… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 11 pages, 4 figures

  16. arXiv:2409.01438  [pdf, other

    eess.AS cs.SD

    Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

    Authors: Weiqing Wang, Kunal Dhawan, Taejin Park, Krishna C. Puvvada, Ivan Medennikov, Somshubra Majumdar, He Huang, Jagadeesh Balam, Boris Ginsburg

    Abstract: Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data scarcity and sparsity. In this paper, we present approaches to enable speech foundation models to process and understand multi-speaker speech with limi… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT 2024

  17. arXiv:2409.00755  [pdf, other

    cs.CV cs.AI

    Trusted Unified Feature-Neighborhood Dynamics for Multi-View Classification

    Authors: Haojian Huang, Chuanyu Qin, Zhe Liu, Kaijing Ma, Jin Chen, Han Fang, Chao Ban, Hao Sun, Zhongjiang He

    Abstract: Multi-view classification (MVC) faces inherent challenges due to domain gaps and inconsistencies across different views, often resulting in uncertainties during the fusion process. While Evidential Deep Learning (EDL) has been effective in addressing view uncertainty, existing methods predominantly rely on the Dempster-Shafer combination rule, which is sensitive to conflicting evidence and often n… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: Ongoing work: 13pages, 13figures, 12 tables

  18. arXiv:2409.00597  [pdf, other

    cs.MM cs.CL

    Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model

    Authors: Fuqiang Niu, Zebang Cheng, Xianghua Fu, Xiaojiang Peng, Genan Dai, Yin Chen, Hu Huang, Bowen Zhang

    Abstract: Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pa… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

    Comments: ACM MM2024

  19. arXiv:2409.00349  [pdf, other

    cs.CV

    ToddlerAct: A Toddler Action Recognition Dataset for Gross Motor Development Assessment

    Authors: Hsiang-Wei Huang, Jiacheng Sun, Cheng-Yen Yang, Zhongyu Jiang, Li-Yu Huang, Jenq-Neng Hwang, Yu-Ching Yeh

    Abstract: Assessing gross motor development in toddlers is crucial for understanding their physical development and identifying potential developmental delays or disorders. However, existing datasets for action recognition primarily focus on adults, lacking the diversity and specificity required for accurate assessment in toddlers. In this paper, we present ToddlerAct, a toddler gross motor action recogniti… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

    Comments: Accepted by 2024 ECCV ABAW Workshop

  20. arXiv:2408.16272  [pdf, other

    cs.CV cs.AI

    Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

    Authors: Kaijing Ma, Haojian Huang, Jin Chen, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, Xuelong Li

    Abstract: Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Ongoing work: 28pages, 19 figures, 7 tables. Code is available at: https://kaijing.space/SRAM/

  21. arXiv:2408.15980  [pdf, other

    cs.RO cs.AI

    In-Context Imitation Learning via Next-Token Prediction

    Authors: Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, Ken Goldberg

    Abstract: We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor traj… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  22. arXiv:2408.13123  [pdf, other

    cs.CV

    Evidential Deep Partial Multi-View Classification With Discount Fusion

    Authors: Haojian Huang, Zhe Liu, Sukumar Letchmunan, Muhammet Deveci, Mingwei Lin, Weizhong Wang

    Abstract: Incomplete multi-view data classification poses significant challenges due to the common issue of missing views in real-world scenarios. Despite advancements, existing methods often fail to provide reliable predictions, largely due to the uncertainty of missing views and the inconsistent quality of imputed data. To tackle these problems, we propose a novel framework called Evidential Deep Partial… ▽ More

    Submitted 30 August, 2024; v1 submitted 23 August, 2024; originally announced August 2024.

    Comments: Ongoing work. 13 pages, 3 figures, 6 tables

  23. arXiv:2408.13106  [pdf, other

    cs.SD eess.AS

    NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

    Authors: He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

    Abstract: Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive due to lacking in sub-sampling or using clustering based speech quantization. In this paper, we propose a simplified and more efficient self-supervised learni… ▽ More

    Submitted 4 September, 2024; v1 submitted 23 August, 2024; originally announced August 2024.

    Comments: work in progress

  24. arXiv:2408.12798  [pdf, other

    cs.AI

    BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

    Authors: Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Jun Sun

    Abstract: Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we int… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  25. arXiv:2408.10453  [pdf, other

    cs.CV cs.GR cs.MM

    Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

    Authors: Liu He, Yizhi Song, Hejun Huang, Daniel Aliaga, Xin Zhou

    Abstract: Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software.… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  26. arXiv:2408.09172   

    cs.AI cs.CL

    Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context Example Selection

    Authors: Hsiu-Yuan Huang, Zichen Wu, Yutong Yang, Junzhao Zhang, Yunfang Wu

    Abstract: Nowadays, Large Language Models (LLMs) have demonstrated exceptional performance across various downstream tasks. However, it is challenging for users to discern whether the responses are generated with certainty or are fabricated to meet user expectations. Estimating the uncertainty of LLMs is particularly challenging due to their vast scale and the lack of white-box access. In this work, we prop… ▽ More

    Submitted 24 August, 2024; v1 submitted 17 August, 2024; originally announced August 2024.

    Comments: The model diagram in Figure 1 on page 3 of the paper has significant ambiguities. It may lead readers to mistakenly believe that the experiments were conducted in a multi-turn dialogue format. Therefore, we request the withdrawal of this submission

  27. arXiv:2408.07009  [pdf, other

    cs.CV

    Imagen 3

    Authors: Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Sergio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Peter Igwe, Christos Kaplanis, Siavash Khodadadeh , et al. (227 additional authors not shown)

    Abstract: We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.

    Submitted 13 August, 2024; originally announced August 2024.

  28. arXiv:2408.06904  [pdf, other

    cs.CL

    Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives

    Authors: Zhihu Wang, Shiwan Zhao, Yu Wang, Heyuan Huang, Jiaxin Shi, Sitao Xie, Zhixing Wang, Yubo Zhang, Hongyan Li, Junchi Yan

    Abstract: As large language models (LLMs) continue to scale, their enhanced performance often proves insufficient for solving domain-specific tasks. Systematically analyzing their failures and effectively enhancing their performance remain significant challenges. This paper introduces the Re-TASK framework, a novel theoretical model that Revisits LLM Tasks from cApability, Skill, Knowledge perspectives, gui… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Work in Progress

  29. arXiv:2408.06614  [pdf, other

    cs.CV cs.MM

    ViMo: Generating Motions from Casual Videos

    Authors: Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

    Abstract: Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting i… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    MSC Class: 68Txx

  30. arXiv:2408.05694  [pdf, other

    cs.CR

    ICSFuzz: Collision Detector Bug Discovery in Autonomous Driving Simulators

    Authors: Weiwei Fu, Heqing Huang, Yifan Zhang, Ke Zhang, Jin Huang, Wei-Bin Lee, Jianping Wang

    Abstract: With the increasing adoption of autonomous vehicles, ensuring the reliability of autonomous driving systems (ADSs) deployed on autonomous vehicles has become a significant concern. Driving simulators have emerged as crucial platforms for testing autonomous driving systems, offering realistic, dynamic, and configurable environments. However, existing simulation-based ADS testers have largely overlo… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  31. ZePo: Zero-Shot Portrait Stylization with Faster Sampling

    Authors: Jin Liu, Huaibo Huang, Jie Cao, Ran He

    Abstract: Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper p… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM 2024

  32. arXiv:2408.03475  [pdf, other

    cs.LG cs.AI

    Can LLMs Serve As Time Series Anomaly Detectors?

    Authors: Manqing Dong, Hao Huang, Longbing Cao

    Abstract: An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities o… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  33. arXiv:2408.02288  [pdf, other

    cond-mat.dis-nn cond-mat.stat-mech cs.AI cs.CL

    Spin glass model of in-context learning

    Authors: Yuhao Li, Ruoran Bai, Haiping Huang

    Abstract: Large language models show a surprising in-context learning ability -- being able to use a prompt to form a prediction for a query, yet without additional training, in stark contrast to old-fashioned supervised learning. Providing a mechanistic interpretation and linking the empirical phenomenon to physics are thus challenging and remain unsolved. We study a simple yet expressive transformer with… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: 8 pages, 4 figures

  34. arXiv:2408.01705  [pdf, other

    cs.CV cs.AI

    Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

    Authors: Weijie Zheng, Xingjun Ma, Hanxun Huang, Zuxuan Wu, Yu-Gang Jiang

    Abstract: With the advancement of vision transformers (ViTs) and self-supervised learning (SSL) techniques, pre-trained large ViTs have become the new foundation models for computer vision applications. However, studies have shown that, like convolutional neural networks (CNNs), ViTs are also susceptible to adversarial attacks, where subtle perturbations in the input can fool the model into making false pre… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  35. arXiv:2408.01332  [pdf, other

    cs.LG

    HMDN: Hierarchical Multi-Distribution Network for Click-Through Rate Prediction

    Authors: Xingyu Lou, Yu Yang, Kuiyao Dong, Heyuan Huang, Wenyi Yu, Ping Wang, Xiu Li, Jun Wang

    Abstract: As the recommendation service needs to address increasingly diverse distributions, such as multi-population, multi-scenario, multitarget, and multi-interest, more and more recent works have focused on multi-distribution modeling and achieved great progress. However, most of them only consider modeling in a single multi-distribution manner, ignoring that mixed multi-distributions often coexist and… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  36. arXiv:2408.00083  [pdf, other

    cs.CV

    Localized Gaussian Splatting Editing with Contextual Awareness

    Authors: Hanyuan Xiao, Yingshu Chen, Huajian Huang, Haolin Xiong, Jing Yang, Pratusha Prasad, Yajie Zhao

    Abstract: Recent text-guided generation of individual 3D object has achieved great success using diffusion priors. However, these methods are not suitable for object insertion and replacement tasks as they do not consider the background, leading to illumination mismatches within the environment. To bridge the gap, we introduce an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS)… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

  37. arXiv:2407.21735  [pdf, other

    cs.CV

    Unifying Event-based Flow, Stereo and Depth Estimation via Feature Similarity Matching

    Authors: Pengjie Zhang, Lin Zhu, Lizhi Wang, Hua Huang

    Abstract: As an emerging vision sensor, the event camera has gained popularity in various vision tasks such as optical flow estimation, stereo matching, and depth estimation due to its high-speed, sparse, and asynchronous event streams. Unlike traditional approaches that use specialized architectures for each specific task, we propose a unified framework, EventMatch, that reformulates these tasks as an even… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  38. arXiv:2407.21075  [pdf, other

    cs.AI cs.CL cs.LG

    Apple Intelligence Foundation Language Models

    Authors: Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek , et al. (130 additional authors not shown)

    Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  39. arXiv:2407.20647  [pdf, other

    cs.CV

    Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

    Authors: Bin Wang, Yuying Liang, Lei Cai, Huakun Huang, Huanqiang Zeng

    Abstract: Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We obs… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  40. arXiv:2407.19813  [pdf, other

    cs.CL cs.AI

    Improving Retrieval Augmented Language Model with Self-Reasoning

    Authors: Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang

    Abstract: The Retrieval-Augmented Language Model (RALM) has shown remarkable performance on knowledge-intensive tasks by incorporating external knowledge during inference, which mitigates the factual hallucinations inherited in large language models (LLMs). Despite these advancements, challenges persist in the implementation of RALMs, particularly concerning their reliability and traceability. To be specifi… ▽ More

    Submitted 2 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

  41. arXiv:2407.19524  [pdf, other

    cs.CV cs.AI

    VersusDebias: Universal Zero-Shot Debiasing for Text-to-Image Models via SLM-Based Prompt Engineering and Generative Adversary

    Authors: Hanjun Luo, Ziye Deng, Haoyu Huang, Xuecheng Liu, Ruizhe Chen, Zuozhu Liu

    Abstract: With the rapid development of Text-to-Image (T2I) models, biases in human image generation against demographic social groups become a significant concern, impacting fairness and ethical standards in AI. Some researchers propose their methods to tackle with the issue. However, existing methods are designed for specific models with fixed prompts, limiting their adaptability to the fast-evolving mode… ▽ More

    Submitted 16 August, 2024; v1 submitted 28 July, 2024; originally announced July 2024.

  42. arXiv:2407.19208  [pdf, other

    cs.GR

    WindPoly: Polygonal Mesh Reconstruction via Winding Numbers

    Authors: Xin He, Chenlei Lv, Pengdi Huang, Hui Huang

    Abstract: Polygonal mesh reconstruction of a raw point cloud is a valuable topic in the field of computer graphics and 3D vision. Especially to 3D architectural models, polygonal mesh provides concise expressions for fundamental geometric structures while effectively reducing data volume. However, there are some limitations of traditional reconstruction methods: normal vector dependency, noisy points and de… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: European Conference on Computer Vision (Proceedings of ECCV 2024)

  43. arXiv:2407.18843  [pdf

    cs.RO physics.bio-ph physics.flu-dyn

    Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

    Authors: Hongbin Huang, Zhonglu Lin, Wei Zheng, Jinhu Zhang, Zhibin Liu, Wei Zhou, Yu Zhang

    Abstract: Median fins of fish-like swimmers play a crucial role in linear acceleration and maneuvering processes. However, few research focused on untethered robotic fish experiments. Imitating the behaviour of real tuna, we developed a free-swimming bionic tuna with a foldable dorsal fin. The erection of dorsal fin, at proper conditions, can reduce head heave by 50%, enhance linear acceleration by 15.7%, i… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: 7 pages, 5 figures

  44. arXiv:2407.18581  [pdf, other

    cs.CL cs.AI

    Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

    Authors: Hukai Huang, Shenghui Lu, Yahui Shan, He Qu, Wenhao Guan, Qingyang Hong, Lin Li

    Abstract: The Mixture of Experts (MoE) approach is well-suited for multilingual and code-switching (CS) tasks due to its multi-expert architecture. This work introduces the DLG-MoE, a Dynamic Language Group-based MoE optimized for bilingual and CS scenarios. DLG-MoE operates based on a hierarchical routing mechanism. First, the language router explicitly models the language and dispatches the representation… ▽ More

    Submitted 7 August, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

  45. arXiv:2407.17691  [pdf, other

    cs.NI eess.SY

    System-Level Simulation Framework for NB-IoT: Key Features and Performance Evaluation

    Authors: Shutao Zhang, Wenkun Wen, Peiran Wu, Hongqing Huang, Liya Zhu, Yijia Guo, Tingting Yang, Minghua Xia

    Abstract: Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3… ▽ More

    Submitted 13 August, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

  46. arXiv:2407.17072  [pdf, other

    stat.ML cs.LG

    An Efficient Procedure for Computing Bayesian Network Structure Learning

    Authors: Hongming Huang, Joe Suzuki

    Abstract: We propose a globally optimal Bayesian network structure discovery algorithm based on a progressively leveled scoring approach. Bayesian network structure discovery is a fundamental yet NP-hard problem in the field of probabilistic graphical models, and as the number of variables increases, memory usage grows exponentially. The simple and effective method proposed by Silander and Myllymäki has bee… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  47. arXiv:2407.17030  [pdf, other

    cs.NI

    Applications of Multi-Agent Deep Reinforcement Learning Communication in Network Management: A Survey

    Authors: Yue Pi, Wang Zhang, Yong Zhang, Hairong Huang, Baoquan Rao, Yulong Ding, Shuanghua Yang

    Abstract: With the advancement of artificial intelligence technology, the automation of network management, also known as Autonomous Driving Networks (ADN), is gaining widespread attention. The network management has shifted from traditional homogeneity and centralization to heterogeneity and decentralization. Multi-agent deep reinforcement learning (MADRL) allows agents to make decisions based on local obs… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  48. arXiv:2407.15240  [pdf, other

    cs.CV

    BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM

    Authors: Hanjun Luo, Haoyu Huang, Ziye Deng, Xuecheng Liu, Ruizhe Chen, Zuozhu Liu

    Abstract: Text-to-Image (T2I) generative models are becoming increasingly crucial due to their ability to generate high-quality images, which also raises concerns about the social biases in their outputs, especially in the human generation. Sociological research has established systematic classifications of bias. However, existing bias research about T2I models conflates different types of bias, impeding me… ▽ More

    Submitted 16 August, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2405.17814

  49. arXiv:2407.13937  [pdf, other

    cs.CV

    Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

    Authors: Sheng-Yao Kuan, Jen-Hao Cheng, Hsiang-Wei Huang, Wenhao Chai, Cheng-Yen Yang, Hugo Latapie, Gaowen Liu, Bing-Fei Wu, Jenq-Neng Hwang

    Abstract: In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's E… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 2024 IEEE Intelligent Vehicles Symposium (IV)

  50. arXiv:2407.13930  [pdf, other

    cs.CV cs.AI eess.SP

    RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

    Authors: Yuan-Hao Ho, Jen-Hao Cheng, Sheng Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang

    Abstract: Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method m… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024