(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 175 results for author: Kautz, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.16426  [pdf, other

    cs.CV cs.AI

    COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

    Authors: Jiefeng Li, Ye Yuan, Davis Rempe, Haotian Zhang, Pavlo Molchanov, Cewu Lu, Jan Kautz, Umar Iqbal

    Abstract: Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained contr… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: ECCV 2024

  2. arXiv:2408.15998  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

    Authors: Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu

    Abstract: The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vis… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Github: https://github.com/NVlabs/Eagle, HuggingFace: https://huggingface.co/NVEagle

  3. arXiv:2408.11796  [pdf, other

    cs.CL cs.AI cs.LG

    LLM Pruning and Distillation in Practice: The Minitron Approach

    Authors: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

    Abstract: We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Align… ▽ More

    Submitted 26 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: v2: Added missing references. Cleaned up runtime performance section

  4. arXiv:2408.10188  [pdf, other

    cs.CV cs.CL

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    Authors: Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han

    Abstract: Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long su… ▽ More

    Submitted 21 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: Code and models are available at https://github.com/NVlabs/VILA/blob/main/LongVILA.md

  5. arXiv:2407.16286  [pdf, other

    cs.LG cs.AI

    A deeper look at depth pruning of LLMs

    Authors: Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov

    Abstract: Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore d… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  6. arXiv:2407.14679  [pdf, other

    cs.CL cs.AI cs.LG

    Compact Language Models via Pruning and Knowledge Distillation

    Authors: Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

    Abstract: Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set o… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  7. arXiv:2407.08083  [pdf, other

    cs.CV

    MambaVision: A Hybrid Mamba-Transformer Vision Backbone

    Authors: Ali Hatamizadeh, Jan Kautz

    Abstract: We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our r… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Tech. report

  8. arXiv:2406.10260  [pdf, other

    cs.CL cs.LG

    Flextron: Many-in-One Flexible Large Language Model

    Authors: Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov

    Abstract: Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. The Flextron architecture utilizes a nested elasti… ▽ More

    Submitted 28 August, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  9. arXiv:2406.07887  [pdf, other

    cs.LG cs.CL

    An Empirical Study of Mamba-based Language Models

    Authors: Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a contr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  10. arXiv:2406.06978  [pdf, other

    cs.CV

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Authors: Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, Yu-Gang Jiang, Jose M. Alvarez

    Abstract: We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment… ▽ More

    Submitted 29 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge

  11. arXiv:2406.02509  [pdf, other

    cs.CV

    CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    Authors: Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat

    Abstract: Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Contro… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Project page: https://ir1d.github.io/CamCo/

  12. arXiv:2406.01584  [pdf, other

    cs.CV

    SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

    Authors: An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu

    Abstract: Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curati… ▽ More

    Submitted 18 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Project Page: https://www.anjiecheng.me/SpatialRGPT

  13. arXiv:2405.19335  [pdf, other

    cs.CV cs.CL cs.LG

    X-VILA: Cross-Modality Alignment for Large Language Model

    Authors: Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

    Abstract: We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effectiv… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Technical Report

  14. arXiv:2405.01533  [pdf, other

    cs.CV

    OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

    Authors: Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

    Abstract: The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work propos… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  15. arXiv:2403.19046  [pdf, other

    cs.CV cs.AI

    LITA: Language Instructed Temporal-Localization Assistant

    Authors: De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz

    Abstract: There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  16. arXiv:2401.13786  [pdf, other

    cs.CV

    FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization

    Authors: Daniel Lichy, Hang Su, Abhishek Badki, Jan Kautz, Orazio Gallo

    Abstract: Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to pr… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: 3DV 2024 (Oral); Project Website: https://research.nvidia.com/labs/lpr/fova-depth/

  17. arXiv:2312.11461  [pdf, other

    cs.CV cs.GR cs.LG

    GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

    Authors: Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal

    Abstract: Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations. However, a n… ▽ More

    Submitted 29 March, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Project website: https://nvlabs.github.io/GAvatar

  18. arXiv:2312.08344  [pdf, other

    cs.CV cs.AI cs.RO

    FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

    Authors: Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield

    Abstract: We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CADきゃど model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit represen… ▽ More

    Submitted 26 March, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  19. arXiv:2312.07533  [pdf, other

    cs.CV

    VILA: On Pre-training for Visual Language Models

    Authors: Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

    Abstract: Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-trai… ▽ More

    Submitted 16 May, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  20. arXiv:2312.07504  [pdf, other

    cs.CV

    COLMAP-Free 3D Gaussian Splatting

    Authors: Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

    Abstract: While neural rendering has led to impressive advances in scene reconstruction and novel view synthesis, it relies heavily on accurately pre-computed camera poses. To relax this constraint, multiple efforts have been made to train Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the implicit representations of NeRFs provide extra challenges to optimize the 3D structure an… ▽ More

    Submitted 30 July, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://oasisyang.github.io/colmap-free-3dgs

  21. arXiv:2312.06709  [pdf, other

    cs.CV

    AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

    Authors: Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

    Abstract: A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name… ▽ More

    Submitted 30 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Version 3: CVPR Camera Ready, reconfigured full paper, table 1 is now more comprehensive Version 2: Added more acknowledgements and updated table 7 with more recent results. Ensured that the link in the abstract to our code is working properly Version 3: Fix broken hyperlinks

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12490-12500

  22. arXiv:2312.03031  [pdf, other

    cs.CV

    Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

    Authors: Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez

    Abstract: End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observ… ▽ More

    Submitted 2 June, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Accept to cvpr 2024

  23. arXiv:2312.02139  [pdf, other

    cs.CV cs.AI cs.LG

    DiffiT: Diffusion Vision Transformers for Image Generation

    Authors: Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat

    Abstract: Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new m… ▽ More

    Submitted 28 August, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Accepted to ECCV'24

  24. arXiv:2310.19731  [pdf, other

    cs.CV cs.AI cs.LG

    ViR: Towards Efficient Vision Retention Backbones

    Authors: Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz

    Abstract: Vision Transformers (ViTs) have attracted a lot of popularity in recent years, due to their exceptional capabilities in modeling long-range spatial dependencies and scalability for large scale training. Although the training parallelism of self-attention mechanism plays an important role in retaining great performance, its quadratic complexity baffles the application of ViTs in many scenarios whic… ▽ More

    Submitted 26 January, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Introduction of Vision Retention Networks (ViR) for Efficient Visual Modeling

  25. arXiv:2310.19694  [pdf, other

    cs.LG

    Convolutional State Space Models for Long-Range Spatiotemporal Modeling

    Authors: Jimmy T. H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon

    Abstract: Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequen… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

  26. arXiv:2310.13768  [pdf, other

    cs.CV

    PACE: Human and Camera Motion Estimation from in-the-wild Videos

    Authors: Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, Umar Iqbal

    Abstract: We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: 3DV 2024. Project page: https://nvlabs.github.io/PACE/

  27. arXiv:2310.01799  [pdf, other

    eess.IV cs.CV

    SMRD: SURE-based Robust MRI Reconstruction with Diffusion Models

    Authors: Batu Ozturkler, Chao Liu, Benjamin Eckart, Morteza Mardani, Jiaming Song, Jan Kautz

    Abstract: Diffusion models have recently gained popularity for accelerated MRI reconstruction due to their high sample quality. They can effectively serve as rich data priors while incorporating the forward model flexibly at inference time, and they have been shown to be more robust than unrolled methods under distribution shifts. However, diffusion models require careful tuning of inference hyperparameters… ▽ More

    Submitted 18 October, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: MICCAI 2023

  28. arXiv:2309.15214  [pdf, other

    cs.LG physics.ao-ph

    Residual Corrective Diffusion Modeling for Km-scale Atmospheric Downscaling

    Authors: Morteza Mardani, Noah Brenowitz, Yair Cohen, Jaideep Pathak, Chieh-Yu Chen, Cheng-Chin Liu, Arash Vahdat, Mohammad Amin Nabian, Tao Ge, Akshay Subramaniam, Karthik Kashinath, Jan Kautz, Mike Pritchard

    Abstract: The state of the art for physical hazard prediction from weather and climate requires expensive km-scale numerical simulations driven by coarser resolution global inputs. Here, a generative diffusion architecture is explored for downscaling such global inputs to km-scale, as a cost-effective machine learning alternative. The model is trained to predict 2km data from a regional weather model over T… ▽ More

    Submitted 11 August, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

  29. arXiv:2309.15164  [pdf, other

    cs.CV cs.AI

    3D Reconstruction with Generalizable Neural Fields using Scene Priors

    Authors: Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu

    Abstract: High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting… ▽ More

    Submitted 28 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Project Page: https://oasisyang.github.io/neural-prior

  30. arXiv:2308.15462  [pdf, other

    cs.CV

    Online Overexposed Pixels Hallucination in Videos with Adaptive Reference Frame Selection

    Authors: Yazhou Xing, Amrita Mazumdar, Anjul Patney, Chao Liu, Hongxu Yin, Qifeng Chen, Jan Kautz, Iuri Frosio

    Abstract: Low dynamic range (LDR) cameras cannot deal with wide dynamic range inputs, frequently leading to local overexposure issues. We present a learning-based system to reduce these artifacts without resorting to complex acquisition mechanisms like alternating exposures or costly processing that are typical of high dynamic range (HDR) imaging. We propose a transformer-based deep neural network (DNN) to… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: The demo video can be found at https://drive.google.com/file/d/1-r12BKImLOYCLUoPzdebnMyNjJ4Rk360/view

  31. arXiv:2307.01492  [pdf, other

    cs.CV cs.RO

    FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

    Authors: Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, Jose M. Alvarez

    Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection.… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Outstanding Champion and Innovation Award in the 3D Occupancy Prediction Challenge (CVPR23)

  32. arXiv:2306.08768  [pdf, other

    cs.CV

    Generalizable One-shot Neural Head Avatar

    Authors: Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz

    Abstract: We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based o… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

  33. arXiv:2306.08593  [pdf, other

    cs.CV cs.LG

    Heterogeneous Continual Learning

    Authors: Divyam Madaan, Hongxu Yin, Wonmin Byeon, Jan Kautz, Pavlo Molchanov

    Abstract: We propose a novel framework and a solution to tackle the continual learning (CL) problem with changing network architectures. Most CL methods focus on adapting a single architecture to a new task/class by modifying its weights. However, with rapid progress in architecture design, the problem of adapting existing solutions to novel architectures becomes relevant. To address this limitation, we pro… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted to CVPR 2023

  34. arXiv:2306.06189  [pdf, other

    cs.CV cs.AI cs.LG

    FasterViT: Fast Vision Transformers with Hierarchical Attention

    Authors: Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

    Abstract: We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-… ▽ More

    Submitted 1 April, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

    Comments: ICLR'24 Accepted Paper

  35. arXiv:2306.00200  [pdf, other

    cs.CV

    Zero-shot Pose Transfer for Unrigged Stylized 3D Characters

    Authors: Jiashun Wang, Xueting Li, Sifei Liu, Shalini De Mello, Orazio Gallo, Xiaolong Wang, Jan Kautz

    Abstract: Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training,… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: CVPR 2023

  36. arXiv:2305.14188  [pdf, other

    cs.LG cs.CR cs.CV

    The Best Defense is a Good Offense: Adversarial Augmentation against Adversarial Attacks

    Authors: Iuri Frosio, Jan Kautz

    Abstract: Many defenses against adversarial attacks (\eg robust classifiers, randomization, or image purification) use countermeasures put to work only after the attack has been crafted. We adopt a different perspective to introduce $A^5$ (Adversarial Augmentation Against Adversarial Attacks), a novel framework including the first certified preemptive defense against adversarial attacks. The main idea is to… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Journal ref: CVPR 2023

  37. arXiv:2305.04391  [pdf, other

    cs.LG cs.CV math.NA stat.ML

    A Variational Perspective on Solving Inverse Problems with Diffusion Models

    Authors: Morteza Mardani, Jiaming Song, Jan Kautz, Arash Vahdat

    Abstract: Diffusion models have emerged as a key pillar of foundation models in visual domains. One of their critical applications is to universally solve different downstream inverse tasks via a single diffusion prior without re-training for each task. Most inverse tasks can be formulated as inferring a posterior distribution over data (e.g., a full image) given a measurement (e.g., a masked image). This i… ▽ More

    Submitted 29 September, 2023; v1 submitted 7 May, 2023; originally announced May 2023.

  38. Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization

    Authors: Connor Z. Lin, Koki Nagano, Jan Kautz, Eric R. Chan, Umar Iqbal, Leonidas Guibas, Gordon Wetzstein, Sameh Khamis

    Abstract: There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable. Although 3D morphable models provide intuitive control for editing and animation, and robustness for single-view face reconstruction, they cannot easily capture geometric and appearance details. Methods based on neural implicit representations, such as signed distance functions (S… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: SIGGRAPH 2023, Project Page: https://research.nvidia.com/labs/toronto-ai/ssif

  39. arXiv:2304.00600  [pdf, other

    cs.CV cs.LG

    Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models

    Authors: Paul Micaelli, Arash Vahdat, Hongxu Yin, Jan Kautz, Pavlo Molchanov

    Abstract: Cascaded computation, whereby predictions are recurrently refined over several stages, has been a persistent theme throughout the development of landmark detection models. In this work, we show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the challenging WFLW facial lan… ▽ More

    Submitted 2 April, 2023; originally announced April 2023.

  40. arXiv:2303.14158  [pdf, other

    cs.CV cs.AI cs.GR cs.RO

    BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

    Authors: Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, Stan Birchfield

    Abstract: We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is ma… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  41. arXiv:2302.07400  [pdf, other

    cs.LG math.FA stat.ML

    Score-based Diffusion Models in Function Space

    Authors: Jae Hyun Lim, Nikola B. Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, Christopher Pal, Arash Vahdat, Anima Anandkumar

    Abstract: Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many… ▽ More

    Submitted 22 November, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: 52 pages

    MSC Class: 46B09 (Primary); 60J22 (Secondary) ACM Class: I.2.6; J.2

  42. arXiv:2212.03237  [pdf, other

    cs.CV

    RANA: Relightable Articulated Neural Avatars

    Authors: Umar Iqbal, Akin Caliskan, Koki Nagano, Sameh Khamis, Pavlo Molchanov, Jan Kautz

    Abstract: We propose RANA, a relightable and articulated neural avatar for the photorealistic synthesis of humans under arbitrary viewpoints, body poses, and lighting. We only require a short video clip of the person to create the avatar and assume no knowledge about the lighting environment. We present a novel framework to model humans while disentangling their geometry, texture, and also lighting environm… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: project page: https://nvlabs.github.io/RANA/

  43. arXiv:2212.02500  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    PhysDiff: Physics-Guided Human Motion Diffusion Model

    Authors: Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, Jan Kautz

    Abstract: Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limit… ▽ More

    Submitted 18 August, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: ICCV 2023 (Oral). Project page: https://nvlabs.github.io/PhysDiff

  44. arXiv:2209.10510  [pdf, other

    cs.CV cs.GR cs.LG

    Learning to Relight Portrait Images via a Virtual Light Stage and Synthetic-to-Real Adaptation

    Authors: Yu-Ying Yeh, Koki Nagano, Sameh Khamis, Jan Kautz, Ming-Yu Liu, Ting-Chun Wang

    Abstract: Given a portrait image of a person and an environment map of the target lighting, portrait relighting aims to re-illuminate the person in the image as if the person appeared in an environment with the target lighting. To achieve high-quality results, recent methods rely on deep learning. An effective approach is to supervise the training of deep neural networks with a high-fidelity dataset of desi… ▽ More

    Submitted 10 August, 2023; v1 submitted 21 September, 2022; originally announced September 2022.

    Comments: To appear in ACM Transactions on Graphics (SIGGRAPH Asia 2022). 21 pages, 25 figures, 7 tables. Project page: https://research.nvidia.com/labs/dir/lumos/

    Journal ref: ACM Trans. Graph. 41, 6, Article 231 (December 2022), 21 pages

  45. arXiv:2208.09480  [pdf, other

    cs.CV

    Neural Light Field Estimation for Street Scenes with Differentiable Virtual Object Insertion

    Authors: Zian Wang, Wenzheng Chen, David Acuna, Jan Kautz, Sanja Fidler

    Abstract: We consider the challenging problem of outdoor lighting estimation for the goal of photorealistic virtual object insertion into photographs. Existing works on outdoor lighting estimation typically simplify the scene lighting into an environment map which cannot capture the spatially-varying lighting effects in outdoor scenes. In this work, we propose a neural approach that estimates the 5D HDR lig… ▽ More

    Submitted 19 August, 2022; originally announced August 2022.

    Comments: Webpage: https://nv-tlabs.github.io/outdoor-ar/

    Journal ref: ECCV 2022

  46. arXiv:2206.09959  [pdf, other

    cs.CV cs.AI cs.LG

    Global Context Vision Transformers

    Authors: Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov

    Abstract: We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attentio… ▽ More

    Submitted 6 June, 2023; v1 submitted 20 June, 2022; originally announced June 2022.

    Comments: Accepted to ICML 2023

  47. arXiv:2205.07058  [pdf, other

    cs.CV

    RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis

    Authors: Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Thomas Müller, Charles Loop, Nathan Morrical, Koki Nagano, Towaki Takikawa, Stan Birchfield

    Abstract: We present a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct… ▽ More

    Submitted 24 October, 2022; v1 submitted 14 May, 2022; originally announced May 2022.

    Comments: ECCV 2022 Workshop on Learning to Generate 3D Shapes and Scenes. Project page at http://www.cs.umd.edu/~mmeshry/projects/rtmv

  48. arXiv:2203.16521  [pdf, other

    cs.CV

    CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs

    Authors: Jiteng Mu, Shalini De Mello, Zhiding Yu, Nuno Vasconcelos, Xiaolong Wang, Jan Kautz, Sifei Liu

    Abstract: Recent advances show that Generative Adversarial Networks (GANs) can synthesize images with smooth variations along semantically meaningful latent directions, such as pose, expression, layout, etc. While this indicates that GANs implicitly learn pixel-level correspondences across images, few studies explored how to extract them explicitly. In this work, we introduce Coordinate GAN (CoordGAN), a st… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Project page: https://jitengmu.github.io/CoordGAN/

  49. arXiv:2203.15798  [pdf, other

    cs.CV

    DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance Fields for Articulated Avatars

    Authors: Amit Raj, Umar Iqbal, Koki Nagano, Sameh Khamis, Pavlo Molchanov, James Hays, Jan Kautz

    Abstract: Acquisition and creation of digital human avatars is an important problem with applications to virtual telepresence, gaming, and human modeling. Most contemporary approaches for avatar generation can be viewed either as 3D-based methods, which use multi-view data to learn a 3D representation with appearance (such as a mesh, implicit surface, or volume), or 2D-based methods which learn photo-realis… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Project page at https://dracon-avatars.github.io/

  50. arXiv:2203.11894  [pdf, other

    cs.CV cs.AI cs.CR cs.DC cs.LG

    GradViT: Gradient Inversion of Vision Transformers

    Authors: Ali Hatamizadeh, Hongxu Yin, Holger Roth, Wenqi Li, Jan Kautz, Daguang Xu, Pavlo Molchanov

    Abstract: In this work we demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks. During this attack, the original data batch is reconstructed given model weights and the corresponding gradients. We introduce a method, named GradViT, that optimizes random noise into naturally looking images via an iterative process. The optimization objective consists of (i) a loss o… ▽ More

    Submitted 27 March, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: CVPR'22 Accepted Paper