Search | arXiv e-print repository

Open-domain Implicit Format Control for Large Language Model Generation

Authors: Yiqun Yao, Wenjia Ma, Xuezhi Fang, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang

Abstract: Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled… ▽ More Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs. This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We observe that this is a non-trivial problem for current LLMs. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality, as well as a benchmark on which we evaluate both the helpfulness and format correctness of LLM outputs. The resulting datasets, named OIFC-SFT, along with the related code, will be made publicly available at https://github.com/cofe-ai/OIFC. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 6 pages

arXiv:2408.03025 [pdf, other]

The Crowd in MOOCs: A Study of Learning Patterns at Scale

Authors: Xin Zhou, Aixin Sun, Jie Zhang, Donghui Lin

Abstract: The increasing availability of learning activity data in Massive Open Online Courses (MOOCs) enables us to conduct a large-scale analysis of learners' learning behavior. In this paper, we analyze a dataset of 351 million learning activities from 0.8 million unique learners enrolled in over 1.6 thousand courses within two years. Specifically, we mine and identify the learning patterns of the crowd… ▽ More The increasing availability of learning activity data in Massive Open Online Courses (MOOCs) enables us to conduct a large-scale analysis of learners' learning behavior. In this paper, we analyze a dataset of 351 million learning activities from 0.8 million unique learners enrolled in over 1.6 thousand courses within two years. Specifically, we mine and identify the learning patterns of the crowd from both temporal and course enrollment perspectives leveraging mutual information theory and sequential pattern mining methods. From the temporal perspective, we find that the time intervals between consecutive learning activities of learners exhibit a mix of power-law and periodic cosine function distribution. By qualifying the relationship between course pairs, we observe that the most frequently co-enrolled courses usually fall in the same category or the same university. We demonstrate these findings can facilitate manifold applications including recommendation tasks on courses. A simple recommendation model utilizing the course enrollment patterns is competitive to the baselines with 200$\times$ faster training time. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Comments: 16 pages

arXiv:2408.01027 [pdf, ps, other]

Randomized Strategyproof Mechanisms with Best of Both Worlds Fairness and Efficiency

Authors: Ankang Sun, Bo Chen

Abstract: We study the problem of mechanism design for allocating a set of indivisible items among agents with private preferences on items. We are interested in such a mechanism that is strategyproof (where agents' best strategy is to report their true preferences) and is expected to ensure fairness and efficiency to a certain degree. We first present an impossibility result that a deterministic mechanism… ▽ More We study the problem of mechanism design for allocating a set of indivisible items among agents with private preferences on items. We are interested in such a mechanism that is strategyproof (where agents' best strategy is to report their true preferences) and is expected to ensure fairness and efficiency to a certain degree. We first present an impossibility result that a deterministic mechanism does not exist that is strategyproof, fair and efficient for allocating indivisible chores. We then utilize randomness to overcome the strong impossibility. For allocating indivisible chores, we propose a randomized mechanism that is strategyproof in expectation as well as ex-ante and ex-post (best of both worlds) fair and efficient. For allocating mixed items, where an item can be a good (i.e., with a positive utility) for one agent but a chore (i.e., a with negative utility) for another, we propose a randomized mechanism that is strategyproof in expectation with best of both worlds fairness and efficiency when there are two agents. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: 27 pages

ACM Class: F.2.2

arXiv:2407.16891 [pdf, other]

Cultural Value Differences of LLMs: Prompt, Language, and Model Size

Authors: Qishuai Zhong, Yike Yun, Aixin Sun

Abstract: Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs). The studied variants include question ordering, prompting language, and model size. Our experiments reveal that each tested LLM can efficiently behave with different cultural values. More interestingly: (i) LLMs exhibit relatively consistent cultural values when presented with prompts in a si… ▽ More Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs). The studied variants include question ordering, prompting language, and model size. Our experiments reveal that each tested LLM can efficiently behave with different cultural values. More interestingly: (i) LLMs exhibit relatively consistent cultural values when presented with prompts in a single language. (ii) The prompting language e.g., Chinese or English, can influence the expression of cultural values. The same question can elicit divergent cultural values when the same LLM is queried in a different language. (iii) Differences in sizes of the same model (e.g., Llama2-7B vs 13B vs 70B) have a more significant impact on their demonstrated cultural values than model differences (e.g., Llama2 vs Mixtral). Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences. △ Less

Submitted 17 June, 2024; originally announced July 2024.

Comments: 20 pages

arXiv:2407.14928 [pdf, other]

Influencer: Empowering Everyday Users in Creating Promotional Posts via AI-infused Exploration and Customization

Authors: Xuye Liu, Annie Sun, Pengcheng An, Tengfei Ma, Jian Zhao

Abstract: Creating promotional posts on social platforms enables everyday users to disseminate their creative outcomes, engage in community exchanges, or generate additional income from micro-businesses. However, creating eye-catching posts combining both original, appealing images and articulate, effective captions can be rather challenging and time-consuming for everyday users who are mostly design novice… ▽ More Creating promotional posts on social platforms enables everyday users to disseminate their creative outcomes, engage in community exchanges, or generate additional income from micro-businesses. However, creating eye-catching posts combining both original, appealing images and articulate, effective captions can be rather challenging and time-consuming for everyday users who are mostly design novices. We propose Influen, an interactive tool to assist novice creators in crafting high-quality promotional post designs, achieving quick design ideation and unencumbered content creation through AI. Within Influencer, we contribute a multi-dimensional recommendation framework that allows users to intuitively generate new ideas through example-based image and caption recommendation. Further, Influencer implements a holistic promotional post design system that supports context-aware image and caption exploration considering brand messages and user-specified design constraints, flexible fusion of various images and captions, and a mind-map-like layout for thinking tracking and post-recording. We evaluated Influencer with 12 design enthusiasts through an in-lab user study by comparing it to a baseline combining Google Search + Figma. Quantitative and qualitative results demonstrate that \sysname{} is effective in assisting design novices to generate ideas as well as creative and diverse promotional posts with user-friendly interaction. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: 18 pages

arXiv:2407.06597 [pdf, other]

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Authors: Renjie Liang, Li Li, Chongzhi Zhang, Jing Wang, Xizhou Zhu, Aixin Sun

Abstract: In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we dev… ▽ More In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq μみゅー$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking} △ Less

Submitted 23 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.02783 [pdf, ps, other]

52B to 1T: Lessons Learned via Tele-FLM Series

Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

Abstract: Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion… ▽ More Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: For the Tele-FLM-52B tech report, see also 2404.16645

arXiv:2407.01523 [pdf, other]

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Authors: Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark co… ▽ More Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc △ Less

Submitted 10 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00316 [pdf, other]

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Authors: Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, Ehsan Adeli

Abstract: Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion mod… ▽ More Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.15501 [pdf]

Secure Combination of Untrusted Time information Based on Optimized Dempster-Shafer Theory

Authors: Yang Li, Yujie Luo, Yichen Zhang, Ao Sun, Wei Huang, Shuai Zhang, Tao Zhang, Chuang Zhou, Li Ma, Jie Yang, Mei Wu, Heng Wang, Yan Pan, Yun Shao, Xing Chen, Ziyang Chen, Song Yu, Hong Guo, Bingjie Xu

Abstract: Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still… ▽ More Secure precision time synchronization is important for applications of Cyber-Physical Systems. However, several attacks, especially the Time Delay Attack (TDA), deteriorates the performance of time synchronization system seriously. Multiple paths scheme is thought as an effective security countermeasure to decrease the influence of TDA. However, the effective secure combination algorithm is still missed for precision time synchronization. In this paper, a secure combination algorithm based on Dempster-Shafer theory is proposed for multiple paths method. Special optimizations are done for the combination algorithm to solve the potential problems due to untrusted evidence. Theoretical simulation shows that the proposed algorithm works much better than Fault Tolerant Algorithm (FTA) and the attack detection method based on single path. And experimental demonstration proves the feasibility and superiority of the proposed algorithm, where the time stability with 27.97 ps, 1.57 ps, and 1.12 ps at average time 1s, 10s, 100s is achieved under TDA and local clock jump. The proposed algorithm can be used to improve the security and resilience of many importance synchronization protocol, such as NTP, PTP, and TWFTT. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.09869 [pdf, ps, other]

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

Authors: Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

Abstract: Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav… ▽ More Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units have shown effectiveness compared to spectral features, they still lag behind continuous SSL representations. In this work, we propose MMM, a multi-layer multi-residual multi-stream discrete units extraction method from SSL. Specifically, we introduce iterative residual vector quantization with K-means for different layers in an SSL model to extract multi-stream speech discrete representation. Through extensive experiments in speech recognition, speech resynthesis, and text-to-speech, we demonstrate the proposed MMM can surpass or on-par with neural codec's performance under various conditions. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech2024

arXiv:2406.04551 [pdf, other]

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Authors: Reyhane Askari Hemmat, Melissa Hall, Alicia Sun, Candace Ross, Michal Drozdzal, Adriana Romero-Soriano

Abstract: With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects… ▽ More With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world. △ Less

Submitted 2 August, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.03488 [pdf, other]

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun

Abstract: The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throug… ▽ More The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git. △ Less

Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 12 pages, 4 figures, 6 tables

arXiv:2404.16645 [pdf, other]

Tele-FLM Technical Report

Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

Abstract: Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a… ▽ More Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.14962 [pdf, ps, other]

Short Regular Girth-8 QC-LDPC Codes From Exponent Matrices with Vertical Symmetry

Authors: Guohua Zhang, Aijing Sun, Ling Liu, Yi Fang

Abstract: To address the challenge of constructing short girth-8 quasi-cyclic (QC) low-density parity-check (LDPC) codes, a novel construction framework based on vertical symmetry (VS) is proposed. Basic properties of the VS structure are presented. With the aid of these properties, existing explicit constructions for column weights from three to five which can be transformed into the VS structure are sorte… ▽ More To address the challenge of constructing short girth-8 quasi-cyclic (QC) low-density parity-check (LDPC) codes, a novel construction framework based on vertical symmetry (VS) is proposed. Basic properties of the VS structure are presented. With the aid of these properties, existing explicit constructions for column weights from three to five which can be transformed into the VS structure are sorted out. Then two novel explicit constructions with the VS structure which guarantee short codes are presented for column weights of three and six. Moreover, an efficient search-based method is also proposed to find short codes with the VS structure. Compared with the state-of-the-art benchmarks, both the explicit constructions and the search-based method presented in this paper can provide shorter codes for most cases. Simulation results show that the new shorter codes can perform almost the same as or better than the longer existing counterparts. Thus, the new shorter codes can fit better with the low-latency requirement for modern communication systems. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 17 pages, 5 figures; This paper has been accepted by IEEE ISIT2024

arXiv:2404.13945 [pdf, other]

How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation

Authors: Liwen Wang, Yuanyuan Yuan, Ao Sun, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang

Abstract: Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revoluti… ▽ More Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. However, the quality of MLLM-produced test inputs in VDL testing remains largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged, and 4) the effectiveness of detecting VDL faults. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-additive" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing. △ Less

Submitted 5 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13375 [pdf, other]

doi 10.1145/3663752.3663756

Beyond Collaborative Filtering: A Relook at Task Formulation in Recommender Systems

Authors: Aixin Sun

Abstract: Recommender Systems (RecSys) have become indispensable in numerous applications, profoundly influencing our everyday experiences. Despite their practical significance, academic research in RecSys often abstracts the formulation of research tasks from real-world contexts, aiming for a clean problem formulation and more generalizable findings. However, it is observed that there is a lack of collecti… ▽ More Recommender Systems (RecSys) have become indispensable in numerous applications, profoundly influencing our everyday experiences. Despite their practical significance, academic research in RecSys often abstracts the formulation of research tasks from real-world contexts, aiming for a clean problem formulation and more generalizable findings. However, it is observed that there is a lack of collective understanding in RecSys academic research. The root of this issue may lie in the simplification of research task definitions, and an overemphasis on modeling the decision outcomes rather than the decision-making process. That is, we often conceptualize RecSys as the task of predicting missing values in a static user-item interaction matrix, rather than predicting a user's decision on the next interaction within a dynamic, changing, and application-specific context. There exists a mismatch between the inputs accessible to a model and the information available to users during their decision-making process, yet the model is tasked to predict users' decisions. While collaborative filtering is effective in learning general preferences from historical records, it is crucial to also consider the dynamic contextual factors in practical settings. Defining research tasks based on application scenarios using domain-specific datasets may lead to more insightful findings. Accordingly, viable solutions and effective evaluations can emerge for different application scenarios. △ Less

Submitted 23 June, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

Comments: Published in ACM SIGWEB Newsletter, Spring 2024: https://dl.acm.org/doi/10.1145/3663752.3663756

Journal ref: SIGWEB Newsl. 2024, Spring, Article 4 (Spring 2024), 11 pages

arXiv:2404.05089 [pdf, other]

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

Authors: Alexandre Muzio, Alex Sun, Churan He

Abstract: The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requiremen… ▽ More The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 8+3 pages

arXiv:2403.17173 [pdf, other]

Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships

Authors: Rangel Daroya, Aaron Sun, Subhransu Maji

Abstract: Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery, multi-tasking, and transfer learning. However, many relationships, such as containment and transferability, are naturally asymmetric and current approaches for representation and visualization (e.g., t-SNE) do not readily support this. We propose Task2… ▽ More Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery, multi-tasking, and transfer learning. However, many relationships, such as containment and transferability, are naturally asymmetric and current approaches for representation and visualization (e.g., t-SNE) do not readily support this. We propose Task2Box, an approach to represent tasks using box embeddings -- axis-aligned hyperrectangles in low dimensional spaces -- that can capture asymmetric relationships between them through volumetric overlaps. We show that Task2Box accurately predicts unseen hierarchical relationships between nodes in ImageNet and iNaturalist datasets, as well as transferability between tasks in the Taskonomy benchmark. We also show that box embeddings estimated from task representations (e.g., CLIP, Task2Vec, or attribute based) can be used to predict relationships between unseen tasks more accurately than classifiers trained on the same representations, as well as handcrafted asymmetric distances (e.g., KL divergence). This suggests that low-dimensional box embeddings can effectively capture these task relationships and have the added advantage of being interpretable. We use the approach to visualize relationships among publicly available image classification datasets on popular dataset hosting platform called Hugging Face. △ Less

Submitted 29 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.09347 [pdf, other]

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun

Abstract: Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across mult… ▽ More Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100. △ Less

Submitted 6 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: 13 pages, 7 figures

arXiv:2403.02181 [pdf, other]

Not All Layers of LLMs Are Necessary During Inference

Authors: Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang

Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict… ▽ More Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks. △ Less

Submitted 9 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.17221 [pdf, ps, other]

Sharpened localization of the trailing point of the Pareto record frontier

Authors: James Allen Fill, Daniel Naiman, Ao Sun

Abstract: For $d\ge2$ and iid $d$-dimensional observations $X^{(1)},X^{(2)},\dots$ with independent Exponential$(1)$ coordinates, we revisit the study by Fill and Naiman (Electron. J. Probab., 2020) of the boundary (relative to the closed positive orthant), or "frontier", $F_n$ of the closed Pareto record-setting (RS) region \[ \mbox{RS}_n:=\{0\le x\in{\mathbb R}^d:x\not\prec X^{(i)}\mbox{\ for all… ▽ More For $d\ge2$ and iid $d$-dimensional observations $X^{(1)},X^{(2)},\dots$ with independent Exponential$(1)$ coordinates, we revisit the study by Fill and Naiman (Electron. J. Probab., 2020) of the boundary (relative to the closed positive orthant), or "frontier", $F_n$ of the closed Pareto record-setting (RS) region \[ \mbox{RS}_n:=\{0\le x\in{\mathbb R}^d:x\not\prec X^{(i)}\mbox{\ for all $1\le i\le n$}\} \] at time $n$, where $0\le x$ means that $0\le x_j$ for $1\le j\le d$ and $x\prec y$ means that $x_j<y_j$ for $1\le j\le d$. With $x_+:=\sum_{j=1}^d x_j$, let \[ F_n^-:=\min\{x_+:x\in F_n\}\quad\mbox{and}\quad F_n^+:=\max\{x_+:x\in F_n\}. \] Almost surely, there are for each $n$ unique vectors $λらむだ_n\in F_n$ and $τたう_n\in F_n$ such that $F_n^+=(λらむだ_n)_+$ and $F_n^-=(τたう_n)_+$; we refer to $λらむだ_n$ and $τたう_n$ as the leading and trailing points, respectively, of the frontier. Fill and Naiman provided rather sharp information about the typical and almost sure behavior of $F^+$, but somewhat crude information about $F^-$, namely, that for any $\varepsilon >0$ and $c_n\to\infty$ we have \[ {\mathbb P}(F_n^- -\ln n\in (-(2+\varepsilon)\ln\ln\ln n,c_n))\to 1 \] (describing typical behavior) and almost surely \[ \limsup \frac{F_n^- - \ln n}{\ln \ln n} \le 0 \quad \mbox{and} \quad \liminf \frac{F_n^- - \ln n}{\ln \ln \ln n} \in [-2, -1]. \] In this paper we use the theory of generators (minima of $F_n$) together with the first- and second-moment methods to improve considerably the trailing-point location results to \[ F_n^- - (\ln n - \ln \ln \ln n) \overset{\mathrm{P}}{\longrightarrow} - \ln(d - 1) \] (describing typical behavior) and, for $d \ge 3$, almost surely \begin{align*} &\limsup [F_n^- - (\ln n - \ln \ln \ln n)] \leq -\ln(d - 2) + \ln 2 \\ \mbox{and }&\liminf [F_n^- - (\ln n - \ln \ln \ln n)] \ge - \ln d - \ln 2. \end{align*} △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 32 pages, 2 figures. arXiv admin note: text overlap with arXiv:1901.05621

MSC Class: 60D05 (Primary) 60F05; 60F15; G0G70; 60G17 (Secondary)

arXiv:2402.17220 [pdf, ps, other]

On the probability of a Pareto record

Authors: James Allen Fill, Ao Sun

Abstract: Given a sequence of independent random vectors taking values in ${\mathbb R}^d$ and having common continuous distribution function $F$, say that the $n^{\rm \scriptsize th}$ observation sets a (Pareto) record if it is not dominated (in every coordinate) by any preceding observation. Let $p_n(F) \equiv p_{n, d}(F)$ denote the probability that the $n^{\rm \scriptsize th}$ observation sets a record.… ▽ More Given a sequence of independent random vectors taking values in ${\mathbb R}^d$ and having common continuous distribution function $F$, say that the $n^{\rm \scriptsize th}$ observation sets a (Pareto) record if it is not dominated (in every coordinate) by any preceding observation. Let $p_n(F) \equiv p_{n, d}(F)$ denote the probability that the $n^{\rm \scriptsize th}$ observation sets a record. There are many interesting questions to address concerning $p_n$ and multivariate records more generally, but this short paper focuses on how $p_n$ varies with $F$, particularly if, under $F$, the coordinates exhibit negative dependence or positive dependence (rather than independence, a more-studied case). We introduce new notions of negative and positive dependence ideally suited for such a study, called negative record-setting probability dependence (NRPD) and positive record-setting probability dependence (PRPD), relate these notions to existing notions of dependence, and for fixed $d \geq 2$ and $n \geq 1$ prove that the image of the mapping $p_n$ on the domain of NRPD (respectively, PRPD) distributions is $[p^*_n, 1]$ (resp., $[n^{-1}, p^*_n]$), where $p^*_n$ is the record-setting probability for any continuous $F$ governing independent coordinates. △ Less

Submitted 5 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 16 pages, 1 figure; this revision responds to three anonymous reviews; paper accepted to Probability in the Engineering and Informational Sciences

MSC Class: 60G70 (Primary) 60D05 (Secondary)

arXiv:2402.14440 [pdf, other]

Recommender for Its Purpose: Repeat and Exploration in Food Delivery Recommendations

Authors: Jiayu Li, Aixin Sun, Weizhi Ma, Peijie Sun, Min Zhang

Abstract: Recommender systems have been widely used for various scenarios, such as e-commerce, news, and music, providing online contents to help and enrich users' daily life. Different scenarios hold distinct and unique characteristics, calling for domain-specific investigations and corresponding designed recommender systems. Therefore, in this paper, we focus on food delivery recommendations to unveil uni… ▽ More Recommender systems have been widely used for various scenarios, such as e-commerce, news, and music, providing online contents to help and enrich users' daily life. Different scenarios hold distinct and unique characteristics, calling for domain-specific investigations and corresponding designed recommender systems. Therefore, in this paper, we focus on food delivery recommendations to unveil unique features in this domain, where users order food online and enjoy their meals shortly after delivery. We first conduct an in-depth analysis on food delivery datasets. The analysis shows that repeat orders are prevalent for both users and stores, and situations' differently influence repeat and exploration consumption in the food delivery recommender systems. Moreover, we revisit the ability of existing situation-aware methods for repeat and exploration recommendations respectively, and find them unable to effectively solve both tasks simultaneously. Based on the analysis and experiments, we have designed two separate recommendation models -- ReRec for repeat orders and ExpRec for exploration orders; both are simple in their design and computation. We conduct experiments on three real-world food delivery datasets, and our proposed models outperform various types of baselines on repeat, exploration, and combined recommendation tasks. This paper emphasizes the importance of dedicated analyses and methods for domain-specific characteristics for the recommender system studies. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 11 pages, 5 figures

arXiv:2402.11451 [pdf, other]

SciAgent: Tool-augmented Language Models for Scientific Reasoning

Authors: Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun, Hany Awadalla, Weizhu Chen

Abstract: Scientific reasoning poses an excessive challenge for even the most advanced Large Language Models (LLMs). To make this task more practical and solvable for LLMs, we introduce a new task setting named tool-augmented scientific reasoning. This setting supplements LLMs with scalable toolsets, and shifts the focus from pursuing an omniscient problem solver to a proficient tool-user. To facilitate the… ▽ More Scientific reasoning poses an excessive challenge for even the most advanced Large Language Models (LLMs). To make this task more practical and solvable for LLMs, we introduce a new task setting named tool-augmented scientific reasoning. This setting supplements LLMs with scalable toolsets, and shifts the focus from pursuing an omniscient problem solver to a proficient tool-user. To facilitate the research of such setting, we construct a tool-augmented training corpus named MathFunc which encompasses over 30,000 samples and roughly 6,000 tools. Building on MathFunc, we develop SciAgent to retrieve, understand and, if necessary, use tools for scientific problem solving. Additionally, we craft a benchmark, SciToolBench, spanning five scientific domains to evaluate LLMs' abilities with tool assistance. Extensive experiments on SciToolBench confirm the effectiveness of SciAgent. Notably, SciAgent-Mistral-7B surpasses other LLMs with the same size by more than 13% in absolute accuracy. Furthermore, SciAgent-DeepMath-7B shows much superior performance than ChatGPT. △ Less

Submitted 20 February, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

arXiv:2402.07329 [pdf, other]

The Bias of Harmful Label Associations in Vision-Language Models

Authors: Caner Hazirbas, Alicia Sun, Yonathan Efroni, Mark Ibrahim

Abstract: Despite the remarkable performance of foundation vision-language models, the shared representation space for text and vision can also encode harmful label associations detrimental to fairness. While prior work has uncovered bias in vision-language models' (VLMs) classification performance across geography, work has been limited along the important axis of harmful label associations due to a lack o… ▽ More Despite the remarkable performance of foundation vision-language models, the shared representation space for text and vision can also encode harmful label associations detrimental to fairness. While prior work has uncovered bias in vision-language models' (VLMs) classification performance across geography, work has been limited along the important axis of harmful label associations due to a lack of rich, labeled data. In this work, we investigate harmful label associations in the recently released Casual Conversations datasets containing more than 70,000 videos. We study bias in the frequency of harmful label associations across self-provided labels for age, gender, apparent skin tone, and physical adornments across several leading VLMs. We find that VLMs are $4-7$x more likely to harmfully classify individuals with darker skin tones. We also find scaling transformer encoder model size leads to higher confidence in harmful predictions. Finally, we find improvements on standard vision tasks across VLMs does not address disparities in harmful label associations. △ Less

Submitted 15 April, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

arXiv:2402.05945 [pdf, other]

Eliminating Information Leakage in Hard Concept Bottleneck Models with Supervised, Hierarchical Concept Learning

Authors: Ao Sun, Yuanyuan Yuan, Pingchuan Ma, Shuai Wang

Abstract: Concept Bottleneck Models (CBMs) aim to deliver interpretable and interventionable predictions by bridging features and labels with human-understandable concepts. While recent CBMs show promising potential, they suffer from information leakage, where unintended information beyond the concepts (either when concepts are represented with probabilities or binary states) are leaked to the subsequent la… ▽ More Concept Bottleneck Models (CBMs) aim to deliver interpretable and interventionable predictions by bridging features and labels with human-understandable concepts. While recent CBMs show promising potential, they suffer from information leakage, where unintended information beyond the concepts (either when concepts are represented with probabilities or binary states) are leaked to the subsequent label prediction. Consequently, distinct classes are falsely classified via indistinguishable concepts, undermining the interpretation and intervention of CBMs. This paper alleviates the information leakage issue by introducing label supervision in concept predication and constructing a hierarchical concept set. Accordingly, we propose a new paradigm of CBMs, namely SupCBM, which achieves label predication via predicted concepts and a deliberately-designed intervention matrix. SupCBM focuses on concepts that are mostly relevant to the predicted label and only distinguishes classes when different concepts are presented. Our evaluations show that SupCBM outperforms SOTA CBMs over diverse datasets. It also manifests better generality across different backbone models. With proper quantification of information leakage in different CBMs, we demonstrate that SupCBM significantly reduces the information leakage. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.14194 [pdf, other]

Parameter-Efficient Conversational Recommender System as a Language Processing Task

Authors: Mathieu Ravaut, Hao Zhang, Lu Xu, Aixin Sun, Yong Liu

Abstract: Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation. Prior work often utilizes external knowledge graphs for items' semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items. This combination of multiple components suffers from a cumbersome t… ▽ More Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation. Prior work often utilizes external knowledge graphs for items' semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items. This combination of multiple components suffers from a cumbersome training process, and leads to semantic misalignment issues between dialogue generation and item recommendation. In this paper, we represent items in natural language and formulate CRS as a natural language processing task. Accordingly, we leverage the power of pre-trained language models to encode items, understand user intent via conversation, perform item recommendation through semantic matching, and generate dialogues. As a unified model, our PECRS (Parameter-Efficient CRS), can be optimized in a single stage, without relying on non-textual metadata such as a knowledge graph. Experiments on two benchmark CRS datasets, ReDial and INSPIRED, demonstrate the effectiveness of PECRS on recommendation and conversation. Our code is available at: https://github.com/Ravoxsg/efficient_unified_crs. △ Less

Submitted 24 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 9 pages, 4 figures, 8 tables, EACL 2024 conference, fixed typo

arXiv:2401.08232 [pdf, other]

Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization

Authors: Chongzhi Zhang, Mingyuan Zhang, Zhiyang Teng, Jiayi Li, Xizhou Zhu, Lewei Lu, Ziwei Liu, Aixin Sun

Abstract: Natural Language Video Localization (NLVL), grounding phrases from natural language descriptions to corresponding video segments, is a complex yet critical task in video understanding. Despite ongoing advancements, many existing solutions lack the capability to globally capture temporal dynamics of the video data. In this study, we present a novel approach to NLVL that aims to address this issue.… ▽ More Natural Language Video Localization (NLVL), grounding phrases from natural language descriptions to corresponding video segments, is a complex yet critical task in video understanding. Despite ongoing advancements, many existing solutions lack the capability to globally capture temporal dynamics of the video data. In this study, we present a novel approach to NLVL that aims to address this issue. Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process, based on the input video and language query. The main challenges are the inherent sparsity and discontinuity of a 2D temporal map in devising the diffusion decoder. To address these challenges, we introduce a multi-scale technique and develop an innovative diffusion decoder. Our approach effectively encapsulates the interaction between the query and video data across various time scales. Experiments on the Charades and DiDeMo datasets underscore the potency of our design. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.07342 [pdf, other]

Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms

Authors: Anchen Sun, Juan J Londono, Batya Elbaum, Luis Estrada, Roberto Jose Lazo, Laura Vitale, Hugo Gonzalez Villasanti, Riccardo Fusaroli, Lynn K Perry, Daniel S Messinger

Abstract: Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both t… ▽ More Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both to classify speakers (ALICE) and to transcribe their utterances (Whisper). We compare results from our framework to those from a human expert for 110 minutes of classroom recordings, including 85 minutes from child-word microphones (n=4 children) and 25 minutes from teacher-worn microphones (n=2 teachers). The overall proportion of agreement, that is, the proportion of correctly classified teacher and child utterances, was .76, with an error-corrected kappa of .50 and a weighted F1 of .76. The word error rate for both teacher and child transcriptions was .15, meaning that 15% of words would need to be deleted, added, or changed to equate the Whisper and expert transcriptions. Moreover, speech features such as the mean length of utterances in words, the proportion of teacher and child utterances that were questions, and the proportion of utterances that were responded to within 2.5 seconds were similar when calculated separately from expert and automated transcriptions. The results suggest substantial progress in analyzing classroom speech that may support children's language development. Future research using natural language processing is under way to improve speaker classification and to analyze results from the application of the automated framework to a larger dataset containing classroom recordings from 13 children and 3 teachers observed on 17 occasions over one year. △ Less

Submitted 10 April, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

Comments: 8 pages, 4 figures, 3 tables, The paper has been accepted to 2024 IEEE International Conference on Development and Learning (ICDL) as a full oral presentation and will appear in the IEEE ICDL proceedings

arXiv:2401.00431 [pdf, other]

Wild2Avatar: Rendering Humans Behind Occlusions

Authors: Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli

Abstract: Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural… ▽ More Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural rendering approach catered for occluded in-the-wild monocular videos. We propose occlusion-aware scene parameterization for decoupling the scene into three parts - occlusion, human, and background. Additionally, extensive objective functions are designed to help enforce the decoupling of the human from both the occlusion and the background and to ensure the completeness of the human model. We verify the effectiveness of our approach with experiments on in-the-wild videos. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.11286 [pdf, ps, other]

Envy-free House Allocation under Uncertain Preferences

Authors: Haris Aziz, Isaiah Iliffe, Bo Li, Angus Ritossa, Ankang Sun, Mashbat Suzuki

Abstract: We study the envy-free house allocation problem when agents have uncertain preferences over items and consider several well-studied preference uncertainty models. The central problem that we focus on is computing an allocation that has the highest probability of being envy-free. We show that each model leads to a distinct set of algorithmic and complexity results, including detailed results on (in… ▽ More We study the envy-free house allocation problem when agents have uncertain preferences over items and consider several well-studied preference uncertainty models. The central problem that we focus on is computing an allocation that has the highest probability of being envy-free. We show that each model leads to a distinct set of algorithmic and complexity results, including detailed results on (in-)approximability. En route, we consider two related problems of checking whether there exists an allocation that is possibly or necessarily envy-free. We give a complete picture of the computational complexity of these two problems for all the uncertainty models we consider. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: To appear in the proceeding of AAAI2024

arXiv:2312.05187 [pdf, other]

Seamless: Multilingual Expressive and Streaming Speech Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2312.04515 [pdf, other]

Efficient Monotonic Multihead Attention

Authors: Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello

Abstract: We introduce the Efficient Monotonic Multihead Attention (EMMA), a state-of-the-art simultaneous translation model with numerically-stable and unbiased monotonic alignment estimation. In addition, we present improved training and inference strategies, including simultaneous fine-tuning from an offline translation model and reduction of monotonic alignment variance. The experimental results demonst… ▽ More We introduce the Efficient Monotonic Multihead Attention (EMMA), a state-of-the-art simultaneous translation model with numerically-stable and unbiased monotonic alignment estimation. In addition, we present improved training and inference strategies, including simultaneous fine-tuning from an offline translation model and reduction of monotonic alignment variance. The experimental results demonstrate that the proposed model attains state-of-the-art performance in simultaneous speech-to-text translation on the Spanish and English translation task. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2310.10570 [pdf, other]

On Context Utilization in Summarization with Large Language Models

Authors: Mathieu Ravaut, Aixin Sun, Nancy F. Chen, Shafiq Joty

Abstract: Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped perfo… ▽ More Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum. △ Less

Submitted 14 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: ACL 2024. 9 pages, 7 figures, 3 tables

arXiv:2310.05634 [pdf, other]

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Authors: Xinze Li, Yixin Cao, Liangming Pan, Yubo Ma, Aixin Sun

Abstract: Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns wit… ▽ More Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new ``Conscious Incompetence" setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs' citation generation, emphasizing the importance of incorporating the "Conscious Incompetence" setting, and the critical role of retrieval accuracy. △ Less

Submitted 23 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: acl findings 2024

arXiv:2310.02720 [pdf, other]

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Authors: Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

Abstract: Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that l… ▽ More Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB). △ Less

Submitted 30 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted at ICLR2024 as spotlight

arXiv:2309.10989 [pdf, other]

COSE: A Consistency-Sensitivity Metric for Saliency on Image Classification

Authors: Rangel Daroya, Aaron Sun, Subhransu Maji

Abstract: We present a set of metrics that utilize vision priors to effectively assess the performance of saliency methods on image classification tasks. To understand behavior in deep learning models, many methods provide visual saliency maps emphasizing image regions that most contribute to a model prediction. However, there is limited work on analyzing the reliability of saliency methods in explaining mo… ▽ More We present a set of metrics that utilize vision priors to effectively assess the performance of saliency methods on image classification tasks. To understand behavior in deep learning models, many methods provide visual saliency maps emphasizing image regions that most contribute to a model prediction. However, there is limited work on analyzing the reliability of saliency methods in explaining model decisions. We propose the metric COnsistency-SEnsitivity (COSE) that quantifies the equivariant and invariant properties of visual model explanations using simple data augmentations. Through our metrics, we show that although saliency methods are thought to be architecture-independent, most methods could better explain transformer-based models over convolutional-based models. In addition, GradCAM was found to outperform other methods in terms of COSE but was shown to have limitations such as lack of variability for fine-grained datasets. The duality between consistency and sensitivity allow the analysis of saliency methods from different angles. Ultimately, we find that it is important to balance these two metrics for a saliency map to faithfully show model behavior. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.08837 [pdf, other]

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Authors: Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao

Abstract: This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and i… ▽ More This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

arXiv:2309.07001 [pdf, other]

Modeling the Evolutionary Trends in Corporate ESG Reporting: A Study based on Knowledge Management Model

Authors: Ziyuan Xia, Anchen Sun, Xiaodong Cai, Saixing Zeng

Abstract: Environmental, social, and governance (ESG) reports are globally recognized as a keystone in sustainable enterprise development. However, current literature has not concluded the development of topics and trends in ESG contexts in the twenty-first century. Therefore, We selected 1114 ESG reports from firms in the technology industry to analyze the evolutionary trends of ESG topics by text mining.… ▽ More Environmental, social, and governance (ESG) reports are globally recognized as a keystone in sustainable enterprise development. However, current literature has not concluded the development of topics and trends in ESG contexts in the twenty-first century. Therefore, We selected 1114 ESG reports from firms in the technology industry to analyze the evolutionary trends of ESG topics by text mining. We discovered the homogenization effect towards low environmental, medium governance, and high social features in the evolution. We also designed a strategic framework to look closer into the dynamic changes of firms' within-industry scores and across-domain importances. We found that companies are gradually converging towards the third quadrant, which indicates that firms contribute less to industrial outstanding and professional distinctiveness in ESG reporting. Firms choose to imitate ESG reports from each other to mitigate uncertainty and enhance behavioral legitimacy. △ Less

Submitted 25 May, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: 29 pages, 10 figures, 3 tables

arXiv:2309.06789 [pdf, other]

An Image Dataset for Benchmarking Recommender Systems with Raw Pixels

Authors: Yu Cheng, Yunzhu Pan, Jiaqi Zhang, Yongxin Ni, Aixin Sun, Fajie Yuan

Abstract: Recommender systems (RS) have achieved significant success by leveraging explicit identification (ID) features. However, the full potential of content features, especially the pure image pixel features, remains relatively unexplored. The limited availability of large, diverse, and content-driven image recommendation datasets has hindered the use of raw images as item representations. In this regar… ▽ More Recommender systems (RS) have achieved significant success by leveraging explicit identification (ID) features. However, the full potential of content features, especially the pure image pixel features, remains relatively unexplored. The limited availability of large, diverse, and content-driven image recommendation datasets has hindered the use of raw images as item representations. In this regard, we present PixelRec, a massive image-centric recommendation dataset that includes approximately 200 million user-image interactions, 30 million users, and 400,000 high-quality cover images. By providing direct access to raw image pixels, PixelRec enables recommendation models to learn item representation directly from them. To demonstrate its utility, we begin by presenting the results of several classical pure ID-based baseline models, termed IDNet, trained on PixelRec. Then, to show the effectiveness of the dataset's image features, we substitute the itemID embeddings (from IDNet) with a powerful vision encoder that represents items using their raw image pixels. This new model is dubbed PixelNet.Our findings indicate that even in standard, non-cold start recommendation settings where IDNet is recognized as highly effective, PixelNet can already perform equally well or even better than IDNet. Moreover, PixelNet has several other notable advantages over IDNet, such as being more effective in cold-start and cross-domain recommendation scenarios. These results underscore the importance of visual features in PixelRec. We believe that PixelRec can serve as a critical resource and testing ground for research on recommendation models that emphasize image pixel content. The dataset, code, and leaderboard will be available at https://github.com/westlake-repl/PixelRec. △ Less

Submitted 17 September, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.03852 [pdf, other]

FLM-101B: An Open LLM and How to Train It with $100K Budget

Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, Zheng Zhang, Aixin Sun, Yequan Wang

Abstract: Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others. Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations. In this paper, we report a solution to significantly reduce LLM training cost through a growth strategy. We demonstrate that a 101B-parameter LLM with 0.… ▽ More Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others. Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations. In this paper, we report a solution to significantly reduce LLM training cost through a growth strategy. We demonstrate that a 101B-parameter LLM with 0.31T tokens can be trained with a budget of 100K US dollars. Inspired by IQ tests, we also consolidate an additional range of evaluations on top of existing evaluations that focus on knowledge-oriented abilities. These IQ evaluations include symbolic mapping, rule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model, named FLM-101B, trained with a budget of 100K US dollars, achieves performance comparable to powerful and well-known models, e.g., GPT-3 and GLM-130B, especially on the additional range of IQ evaluations. The checkpoint of FLM-101B is released at https://huggingface.co/CofeAI/FLM-101B. △ Less

Submitted 17 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.13561 [pdf, other]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Authors: Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad Somasundaram, Gustavo Solaira , et al. (49 additional authors not shown)

Abstract: Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, mul… ▽ More Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, multi-modal data recording and streaming device with the goal to foster and accelerate research in this area. In this paper, we describe the Aria device hardware including its sensor configuration and the corresponding software tools that enable recording and processing of such data. △ Less

Submitted 1 October, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

arXiv:2308.11596 [pdf, other]

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication △ Less

Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

ACM Class: I.2.7

arXiv:2308.04622 [pdf, other]

Rendering Humans from Object-Occluded Monocular Videos

Authors: Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, Li Fei-Fei

Abstract: 3D understanding and rendering of moving humans from monocular videos is a challenging task. Despite recent progress, the task remains difficult in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. First, the standard rendering strategy relies on point-point mapping, w… ▽ More 3D understanding and rendering of moving humans from monocular videos is a challenging task. Despite recent progress, the task remains difficult in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. First, the standard rendering strategy relies on point-point mapping, which could lead to dramatic disparities between the visible and occluded areas of the body. Second, the naive direct regression approach does not consider any feasibility criteria (ie, prior information) for rendering under occlusions. To tackle the above drawbacks, we present OccNeRF, a neural rendering method that achieves better rendering of humans in severely occluded scenes. As direct solutions to the two drawbacks, we propose surface-based rendering by integrating geometry and visibility priors. We validate our method on both simulated and real-world occlusions and demonstrate our method's superiority. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: ICCV 2023, project page: https://cs.stanford.edu/~xtiange/projects/occnerf/

arXiv:2307.16382 [pdf, other]

Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information?

Authors: Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, Vaikkunth Mugunthan

Abstract: Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memoriza… ▽ More Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memorization attack on any closed-source models. In this work, we simulate a privacy attack on GPT-3 using OpenAI's fine-tuning API. Our objective is to determine if personally identifiable information (PII) can be extracted from this model. We (1) explore the use of naive prompting methods on a GPT-3 fine-tuned classification model, and (2) we design a practical word generation task called Autocomplete to investigate the extent of PII memorization in fine-tuned GPT-3 within a real-world context. Our findings reveal that fine-tuning GPT3 for both tasks led to the model memorizing and disclosing critical personally identifiable information (PII) obtained from the underlying fine-tuning dataset. To encourage further research, we have made our codes and datasets publicly available on GitHub at: https://github.com/albertsun1/gpt3-pii-attacks △ Less

Submitted 15 April, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.16090 [pdf, other]

Rapid Flood Inundation Forecast Using Fourier Neural Operator

Authors: Alexander Y. Sun, Zhi Li, Wonhyun Lee, Qixing Huang, Bridget R. Scanlon, Clint Dawson

Abstract: Flood inundation forecast provides critical information for emergency planning before and during flood events. Real time flood inundation forecast tools are still lacking. High-resolution hydrodynamic modeling has become more accessible in recent years, however, predicting flood extents at the street and building levels in real-time is still computationally demanding. Here we present a hybrid proc… ▽ More Flood inundation forecast provides critical information for emergency planning before and during flood events. Real time flood inundation forecast tools are still lacking. High-resolution hydrodynamic modeling has become more accessible in recent years, however, predicting flood extents at the street and building levels in real-time is still computationally demanding. Here we present a hybrid process-based and data-driven machine learning (ML) approach for flood extent and inundation depth prediction. We used the Fourier neural operator (FNO), a highly efficient ML method, for surrogate modeling. The FNO model is demonstrated over an urban area in Houston (Texas, U.S.) by training using simulated water depths (in 15-min intervals) from six historical storm events and then tested over two holdout events. Results show FNO outperforms the baseline U-Net model. It maintains high predictability at all lead times tested (up to 3 hrs) and performs well when applying to new sites, suggesting strong generalization skill. △ Less

Submitted 29 July, 2023; originally announced July 2023.

Comments: Artificial Intelligence (AI) and Humanitarian Assistance and Disaster Recovery (HADR) workshop, ICCV 2023 in Paris, France

arXiv:2307.15061 [pdf, other]

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Authors: Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan Jing, Pengwei Li, Xiaohua Qi, Cheng Jin, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng , et al. (18 additional authors not shown)

Abstract: Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summari… ▽ More Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Technical Report; 65 pages, 34 figures, 24 tables; Code at https://github.com/ldkong1205/RoboDepth

arXiv:2307.09985 [pdf, other]

Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?

Authors: Yu-chen Fan, Yitong Ji, Jie Zhang, Aixin Sun

Abstract: A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g., like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis of the MovieLens data… ▽ More A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g., like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis of the MovieLens dataset and explain the potential impact of using the dataset for evaluating recommendation algorithms. We make a few main findings from our analysis. First, there are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. The early interactions largely define the user portrait which affects the subsequent interactions. Second, user interactions are highly affected by the candidate movies that are recommended by the platform's internal recommendation algorithm(s). Third, changing the order of user interactions makes it more difficult for sequential algorithms to capture the progressive interaction process. We further discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real-world recommendation scenarios. In summary, the MovieLens platform demonstrates an efficient and effective way of collecting user preferences to address cold-starts. However, models that achieve excellent recommendation accuracy on the MovieLens dataset may not demonstrate superior performance in practice, for at least two kinds of differences: (i) the differences in the contexts of user-item interaction generation, and (ii) the differences in user knowledge about the item collections. While results on MovieLens can be useful as a reference, they should not be solely relied upon as the primary justification for the effectiveness of a recommendation system model. △ Less

Submitted 24 March, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.06506 [pdf, other]

Research Explosion: More Effort to Climb onto Shoulders of the Giant

Authors: Guoxiu He, Aixin Sun, Wei Lu

Abstract: Fast-growing scientific publications present challenges to the scientific community. In this paper, we describe their implications to researchers. As references form explicit foundations for researchers to conduct a study, we investigate the evolution in reference patterns based on 60.8 million papers published from 1960 to 2015. The results demonstrate that recent papers contain more references t… ▽ More Fast-growing scientific publications present challenges to the scientific community. In this paper, we describe their implications to researchers. As references form explicit foundations for researchers to conduct a study, we investigate the evolution in reference patterns based on 60.8 million papers published from 1960 to 2015. The results demonstrate that recent papers contain more references than older ones, especially the well-cited papers compared with other papers. Well-cited papers receive 10 or more citations within 5 years of publication. Their references cover a longer period from classic research to very recent studies. Authors of well-cited papers are also farsighted to discover the reference papers with good potential to receive high citation numbers in near future. We also discover that the number of accumulative publications has a negative impact on next-5-year citation count for most fields except Chemistry, Materials science, Environmental science, Biology, and Engineering. Our findings suggest that researchers are expected to devote more effort to producing impactful research. Based on all these findings, we strongly advise against judging researchers simply based on the number of their publications. On the other hand, authors and reviewers should ensure that published papers contain adequate contributions. The code for our analysis is on GitHub: https://github.com/ECNU-Text-Computing/Research-Explosion. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: 21 pages, 24 figures

Showing 1–50 of 154 results for author: Sun, A