Search | arXiv e-print repository

arXiv:2407.18309 [pdf]

Adaptive Terminal Sliding Mode Control Using Deep Reinforcement Learning for Zero-Force Control of Exoskeleton Robot Systems

Authors: Morteza Mirzaee, Reza Kazemi

Abstract: This paper introduces a novel zero-force control method for upper-limb exoskeleton robots, which are used in a variety of applications including rehabilitation, assistance, and human physical capability enhancement. The proposed control method employs an Adaptive Integral Terminal Sliding Mode (AITSM) controller, combined with an exponential reaching law and Proximal Policy Optimization (PPO), a t… ▽ More This paper introduces a novel zero-force control method for upper-limb exoskeleton robots, which are used in a variety of applications including rehabilitation, assistance, and human physical capability enhancement. The proposed control method employs an Adaptive Integral Terminal Sliding Mode (AITSM) controller, combined with an exponential reaching law and Proximal Policy Optimization (PPO), a type of Deep Reinforcement Learning (DRL). The PPO system incorporates an attention mechanism and Long Short-Term Memory (LSTM) neural networks, enabling the controller to selectively focus on relevant system states, adapt to changing behavior, and capture long-term dependencies. This controller is designed to manage a 5-DOF upper-limb exoskeleton robot with zero force, even amidst system uncertainties. The controller uses an integral terminal sliding surface to ensure finite-time convergence to the desired state, a crucial feature for applications requiring quick responses. It also includes an exponential switching control term to reduce chattering and improve system accuracy. The controller's adaptability, facilitated by the PPO system, allows real-time parameter adjustments based on system feedback, making the controller robust and capable of dealing with uncertainties and disturbances that could affect the performance of the exoskeleton. The proposed control method's effectiveness and superiority are confirmed through numerical simulations and comparisons with existing control methods. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.06315 [pdf, other]

Shedding More Light on Robust Classifiers under the lens of Energy-based Models

Authors: Mujtaba Hussain Mirza, Maria Rosaria Briglia, Senad Beadini, Iacopo Masi

Abstract: By reinterpreting a robust discriminative classifier as Energy-based Model (EBM), we offer a new take on the dynamics of adversarial training (AT). Our analysis of the energy landscape during AT reveals that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data from the point of view of the model. Conversely, we observe the opposite for targ… ▽ More By reinterpreting a robust discriminative classifier as Energy-based Model (EBM), we offer a new take on the dynamics of adversarial training (AT). Our analysis of the energy landscape during AT reveals that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data from the point of view of the model. Conversely, we observe the opposite for targeted attacks. On the ground of our thorough analysis, we present new theoretical and practical results that show how interpreting AT energy dynamics unlocks a better understanding: (1) AT dynamic is governed by three phases and robust overfitting occurs in the third phase with a drastic divergence between natural and adversarial energies (2) by rewriting the loss of TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) in terms of energies, we show that TRADES implicitly alleviates overfitting by means of aligning the natural energy with the adversarial one (3) we empirically show that all recent state-of-the-art robust classifiers are smoothing the energy landscape and we reconcile a variety of studies about understanding AT and weighting the loss function under the umbrella of EBMs. Motivated by rigorous evidence, we propose Weighted Energy Adversarial Training (WEAT), a novel sample weighting scheme that yields robust accuracy matching the state-of-the-art on multiple benchmarks such as CIFAR-10 and SVHN and going beyond in CIFAR-100 and Tiny-ImageNet. We further show that robust classifiers vary in the intensity and quality of their generative capabilities, and offer a simple method to push this capability, reaching a remarkable Inception Score (IS) and FID using a robust classifier without training for generative modeling. The code to reproduce our results is available at http://github.com/OmnAI-Lab/Robust-Classifiers-under-the-lens-of-EBM/ . △ Less

Submitted 11 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted at European Conference on Computer Vision (ECCV) 2024

arXiv:2406.09240 [pdf, other]

Comparison Visual Instruction Tuning

Authors: Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja Giryes, Sepp Hochreiter, Leonid Karlinsky

Abstract: Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attent… ▽ More Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset repo: https://huggingface.co/datasets/wlin21at/CaD-Inst

arXiv:2406.08164 [pdf, other]

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Authors: Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark… ▽ More Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: The first three authors contributed equally

arXiv:2406.06638 [pdf, other]

Particle Multi-Axis Transformer for Jet Tagging

Authors: Muhammad Usman, M Husnain Shahid, Maheen Ejaz, Ummay Hani, Nayab Fatima, Abdul Rehman Khan, Asifullah Khan, Nasir Majid Mirza

Abstract: Jet tagging is an essential categorization problem in high energy physics. In recent times, Deep Learning has not only risen to the challenge of jet tagging but also significantly improved its performance. In this article, we proposed an idea of a new architecture, Particle Multi-Axis transformer (ParMAT) which is a modified version of Particle transformer (ParT). ParMAT contains local and global… ▽ More Jet tagging is an essential categorization problem in high energy physics. In recent times, Deep Learning has not only risen to the challenge of jet tagging but also significantly improved its performance. In this article, we proposed an idea of a new architecture, Particle Multi-Axis transformer (ParMAT) which is a modified version of Particle transformer (ParT). ParMAT contains local and global spatial interactions within a single unit which improves its ability to handle various input lengths. We trained our model on JETCLASS, a publicly available large dataset that contains 100M jets of 10 different classes of particles. By integrating a parallel attention mechanism and pairwise interactions of particles in the attention mechanism, ParMAT achieves robustness and higher accuracy over the ParT and ParticleNet. The scalability of the model to huge datasets and its ability to automatically extract essential features demonstrate its potential for enhancing jet tagging. △ Less

Submitted 16 July, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2404.10534 [pdf, other]

Into the Fog: Evaluating Multiple Object Tracking Robustness

Authors: Nadezda Kirillova, M. Jehanzeb Mirza, Horst Possegger, Horst Bischof

Abstract: State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a… ▽ More State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2403.12736 [pdf, other]

Towards Multimodal In-Context Learning for Vision & Language Models

Authors: Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

Abstract: State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis… ▽ More State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art. △ Less

Submitted 17 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.11755 [pdf, other]

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing th… ▽ More Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively △ Less

Submitted 7 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: ECCV Camera Ready. Code & Data: https://jmiemirza.github.io/Meta-Prompting/

arXiv:2403.11691 [pdf, other]

TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

Authors: Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla

Abstract: Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D)… ▽ More Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.09193 [pdf, other]

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper

Abstract: Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models en… ▽ More Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2309.06809 [pdf, other]

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof

Abstract: Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been… ▽ More Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Code is available at: https://github.com/jmiemirza/TAP

arXiv:2305.18953 [pdf, other]

Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions

Authors: Stefan Leitner, M. Jehanzeb Mirza, Wei Lin, Jakub Micorek, Marc Masana, Mateusz Kozinski, Horst Possegger, Horst Bischof

Abstract: In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic fo… ▽ More In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic forgetting. To efficiently mitigate forgetting, we propose Domain-Incremental Learning through Activation Matching (DILAM), which employs unsupervised feature alignment to adapt only the affine parameters of a clear weather pre-trained network to different weather conditions. We propose to store these affine parameters as a memory bank for each weather condition and plug-in their weather-specific parameters during driving (i.e. test time) when the respective weather conditions are encountered. Our memory bank is extremely lightweight, since affine parameters account for less than 2% of a typical object detector. Furthermore, contrary to previous domain-incremental learning approaches, we do not require the weather label when testing and propose to automatically infer the weather condition by a majority voting linear classifier. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Intelligent Vehicle Conference (oral presentation)

arXiv:2305.18287 [pdf, other]

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Mateusz Kozinski, Horst Possegger, Rogerio Feris, Horst Bischof

Abstract: Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed categ… ▽ More Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision. △ Less

Submitted 23 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023 (Camera Ready) - Project Page: https://jmiemirza.github.io/LaFTer/

arXiv:2211.15393 [pdf, other]

Video Test-Time Adaptation for Action Recognition

Authors: Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof

Abstract: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal mode… ▽ More Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}. △ Less

Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted at CVPR 2023

arXiv:2211.12870 [pdf, other]

ActMAD: Activation Matching to Align Distributions for Test-Time-Training

Authors: Muhammad Jehanzeb Mirza, Pol Jané Soneira, Wei Lin, Mateusz Kozinski, Horst Possegger, Horst Bischof

Abstract: Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the… ▽ More Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance. △ Less

Submitted 23 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

Comments: CVPR 2023 - Project Page: https://jmiemirza.github.io/ActMAD/

arXiv:2211.11432 [pdf, other]

MATE: Masked Autoencoders are Online 3D Test-Time Learners

Authors: M. Jehanzeb Mirza, Inkyu Shin, Wei Lin, Andreas Schriebl, Kunyang Sun, Jaesung Choe, Horst Possegger, Mateusz Kozinski, In So Kweon, Kun-Jin Yoon, Horst Bischof

Abstract: Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is remove… ▽ More Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications. △ Less

Submitted 20 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: Code is available at this repository: https://github.com/jmiemirza/MATE

arXiv:2211.05854 [pdf, other]

Test-time adversarial detection and robustness for localizing humans using ultra wide band channel impulse responses

Authors: Abhiram Kolli, Muhammad Jehanzeb Mirza, Horst Possegger, Horst Bischof

Abstract: Keyless entry systems in cars are adopting neural networks for localizing its operators. Using test-time adversarial defences equip such systems with the ability to defend against adversarial attacks without prior training on adversarial samples. We propose a test-time adversarial example detector which detects the input adversarial example through quantifying the localized intermediate responses… ▽ More Keyless entry systems in cars are adopting neural networks for localizing its operators. Using test-time adversarial defences equip such systems with the ability to defend against adversarial attacks without prior training on adversarial samples. We propose a test-time adversarial example detector which detects the input adversarial example through quantifying the localized intermediate responses of a pre-trained neural network and confidence scores of an auxiliary softmax layer. Furthermore, in order to make the network robust, we extenuate the non-relevant features by non-iterative input sample clipping. Using our approach, mean performance over 15 levels of adversarial perturbations is increased by 55.33% for the fast gradient sign method (FGSM) and 6.3% for both the basic iterative method (BIM) and the projected gradient method (PGD). △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures, ICASSP Conference

arXiv:2204.08817 [pdf, other]

An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions

Authors: M. Jehanzeb Mirza, Marc Masana, Horst Possegger, Horst Bischof

Abstract: Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learn… ▽ More Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learning approaches which usually re-train the model by either keeping a memory bank of training samples or keeping a copy of the entire model or model parameters for each scenario. While these approaches show impressive results, they can be prone to scalability issues and their applicability for autonomous driving in all weather conditions has not been shown. In this paper we propose DISC -- Domain Incremental through Statistical Correction -- a simple online zero-forgetting approach which can incrementally learn new tasks (i.e weather conditions) without requiring re-training or expensive memory banks. The only information we store for each task are the statistical parameters as we categorize each domain by the change in first and second order statistics. Thus, as each task arrives, we simply 'plug and play' the statistical vectors for the corresponding task into the model and it immediately starts to perform well on that task. We show the efficacy of our approach by testing it for object detection in a challenging domain-incremental autonomous driving scenario where we encounter different adverse weather conditions, such as heavy rain, fog, and snow. △ Less

Submitted 21 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR Workshops - Camera Ready Version

arXiv:2202.08417 [pdf, other]

Retrieval-Augmented Reinforcement Learning

Authors: Anirudh Goyal, Abram L. Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adria Puigdomenech Badia, Arthur Guez, Mehdi Mirza, Peter C. Humphreys, Ksenia Konyushkova, Laurent Sifre, Michal Valko, Simon Osindero, Timothy Lillicrap, Nicolas Heess, Charles Blundell

Abstract: Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the… ▽ More Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method. △ Less

Submitted 24 May, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

arXiv:2112.00463 [pdf, other]

The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Authors: M. Jehanzeb Mirza, Jakub Micorek, Horst Possegger, Horst Bischof

Abstract: Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To… ▽ More Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To address this problem of continuous adaptation to distribution shifts, we propose Dynamic Unsupervised Adaptation (DUA). By continuously adapting the statistics of the batch normalization layers we modify the feature representations of the model. We show that by sequentially adapting a model with only a fraction of unlabeled data, a strong performance gain can be achieved. With even less than 1% of unlabeled data from the target domain, DUA already achieves competitive results to strong baselines. In addition, the computational overhead is minimal in contrast to previous approaches. Our approach is simple, yet effective and can be applied to any architecture which uses batch normalization as one of its components. We show the utility of DUA by evaluating it on a variety of domain adaptation datasets and tasks including object recognition, digit recognition and object detection. △ Less

Submitted 4 April, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted to CVPR 2022 - Camera Ready Version - Code: https://github.com/jmiemirza/DUA

arXiv:2110.03363 [pdf, other]

Evaluating model-based planning and planner amortization for continuous control

Authors: Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller

Abstract: There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC.… ▽ More There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We find that well-tuned model-free agents are strong baselines even for high DoF control problems but MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multi-task/multi-goal settings. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance. Videos of agents performing different tasks can be seen at https://sites.google.com/view/mbrl-amortization/home. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 9 pages main text, 30 pages with references and appendix including several ablations and additional experiments. Submitted to ICLR 2022

arXiv:2106.08795 [pdf, other]

Robustness of Object Detectors in Degrading Weather Conditions

Authors: Muhammad Jehanzeb Mirza, Cornelius Buerkle, Julio Jarquin, Michael Opitz, Fabian Oboril, Kay-Ulrich Scholl, Horst Bischof

Abstract: State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and… ▽ More State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and perform one of the most detailed evaluation on single and dual modality architectures on data captured in real weather conditions. We analyse the performance degradation of these architectures in degrading weather conditions. We demonstrate that an object detection architecture performing good in clear weather might not be able to handle degrading weather conditions. We also perform ablation studies on the dual modality architectures and show their limitations. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted for publication at ITSC 2021

arXiv:2010.01298 [pdf, other]

Beyond Tabula-Rasa: a Modular Reinforcement Learning Approach for Physically Embedded 3D Sokoban

Authors: Peter Karkus, Mehdi Mirza, Arthur Guez, Andrew Jaegle, Timothy Lillicrap, Lars Buesing, Nicolas Heess, Theophane Weber

Abstract: Intelligent robots need to achieve abstract objectives using concrete, spatiotemporally complex sensory information and motor control. Tabula rasa deep reinforcement learning (RL) has tackled demanding tasks in terms of either visual, abstract, or physical reasoning, but solving these jointly remains a formidable challenge. One recent, unsolved benchmark task that integrates these challenges is Mu… ▽ More Intelligent robots need to achieve abstract objectives using concrete, spatiotemporally complex sensory information and motor control. Tabula rasa deep reinforcement learning (RL) has tackled demanding tasks in terms of either visual, abstract, or physical reasoning, but solving these jointly remains a formidable challenge. One recent, unsolved benchmark task that integrates these challenges is Mujoban, where a robot needs to arrange 3D warehouses generated from 2D Sokoban puzzles. We explore whether integrated tasks like Mujoban can be solved by composing RL modules together in a sense-plan-act hierarchy, where modules have well-defined roles similarly to classic robot architectures. Unlike classic architectures that are typically model-based, we use only model-free modules trained with RL or supervised learning. We find that our modular RL approach dramatically outperforms the state-of-the-art monolithic RL agent on Mujoban. Further, learned modules can be reused when, e.g., using a different robot platform to solve the same task. Together our results give strong evidence for the importance of research into modular RL designs. Project website: https://sites.google.com/view/modular-rl/ △ Less

Submitted 3 October, 2020; originally announced October 2020.

arXiv:2009.05524 [pdf, other]

Physically Embedded Planning Problems: New Challenges for Reinforcement Learning

Authors: Mehdi Mirza, Andrew Jaegle, Jonathan J. Hunt, Arthur Guez, Saran Tunyasuvunakool, Alistair Muldal, Théophane Weber, Peter Karkus, Sébastien Racanière, Lars Buesing, Timothy Lillicrap, Nicolas Heess

Abstract: Recent work in deep reinforcement learning (RL) has produced algorithms capable of mastering challenging games such as Go, chess, or shogi. In these works the RL agent directly observes the natural state of the game and controls that state directly with its actions. However, when humans play such games, they do not just reason about the moves but also interact with their physical environment. They… ▽ More Recent work in deep reinforcement learning (RL) has produced algorithms capable of mastering challenging games such as Go, chess, or shogi. In these works the RL agent directly observes the natural state of the game and controls that state directly with its actions. However, when humans play such games, they do not just reason about the moves but also interact with their physical environment. They understand the state of the game by looking at the physical board in front of them and modify it by manipulating pieces using touch and fine-grained motor control. Mastering complicated physical systems with abstract goals is a central challenge for artificial intelligence, but it remains out of reach for existing RL algorithms. To encourage progress towards this goal we introduce a set of physically embedded planning problems and make them publicly available. We embed challenging symbolic tasks (Sokoban, tic-tac-toe, and Go) in a physics engine to produce a set of tasks that require perception, reasoning, and motor control over long time horizons. Although existing RL algorithms can tackle the symbolic versions of these tasks, we find that they struggle to master even the simplest of their physically embedded counterparts. As a first step towards characterizing the space of solution to these tasks, we introduce a strong baseline that uses a pre-trained expert game player to provide hints in the abstract space to an RL agent's policy while training it on the full sensorimotor control task. The resulting agent solves many of the tasks, underlining the need for methods that bridge the gap between abstract planning and embodied control. See illustrating video at https://youtu.be/RwHiHlym_1k. △ Less

Submitted 29 October, 2020; v1 submitted 11 September, 2020; originally announced September 2020.

Comments: 17 pages + appendix. Updated text and references

arXiv:2006.03531 [pdf, other]

A Meta-Bayesian Model of Intentional Visual Search

Authors: Maell Cullen, Jonathan Monney, M. Berk Mirza, Rosalyn Moran

Abstract: We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning. To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze. The condi… ▽ More We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning. To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze. The conditional independencies imposed by a separation of time scales in this task are embodied by constraints on the hierarchical structure of our model; planning and decision making are cast as a partially observable Markov Decision Process while proprioceptive and exteroceptive signals are integrated by a dynamic model that facilitates approximate inference on visual information and its latent causes. Our model is able to recapitulate human behavioural metrics such as classification accuracy while retaining a high degree of interpretability, which we demonstrate by recovering subject-specific parameters from observed human behaviour. △ Less

Submitted 5 June, 2020; originally announced June 2020.

Comments: Submitted to NeurIPS 2020

arXiv:2002.01324 [pdf]

doi 10.1016/j.physleta.2020.126287

Mean Field Theory for the Quantum Rabi Model, Inconsistency to the Rotating Wave Approximation

Authors: Ghasem Asadi Cordshooli, Mehdi Mirzaee

Abstract: Considering well localized atom, the mean field theory (MFT) was applied to replace the operators by equivalent expectation values. The Rabi model was reduced to a fourth orders NDE describing atoms position. Solution by the harmonic balance method (HBM) showed good accuracy and consistency to the numerical results, which introduces it as a useful tool in the quantum dynamics studies. Considering well localized atom, the mean field theory (MFT) was applied to replace the operators by equivalent expectation values. The Rabi model was reduced to a fourth orders NDE describing atoms position. Solution by the harmonic balance method (HBM) showed good accuracy and consistency to the numerical results, which introduces it as a useful tool in the quantum dynamics studies. △ Less

Submitted 2 February, 2020; originally announced February 2020.

arXiv:1910.10958 [pdf]

Malware Classification using Deep Learning based Feature Extraction and Wrapper based Feature Selection Technique

Authors: Muhammad Furqan Rafique, Muhammad Ali, Aqsa Saeed Qureshi, Asifullah Khan, Anwar Majid Mirza

Abstract: In the case of malware analysis, categorization of malicious files is an essential part after malware detection. Numerous static and dynamic techniques have been reported so far for categorizing malware. This research presents a deep learning-based malware detection (DLMD) technique based on static methods for classifying different malware families. The proposed DLMD technique uses both the byte a… ▽ More In the case of malware analysis, categorization of malicious files is an essential part after malware detection. Numerous static and dynamic techniques have been reported so far for categorizing malware. This research presents a deep learning-based malware detection (DLMD) technique based on static methods for classifying different malware families. The proposed DLMD technique uses both the byte and ASM files for feature engineering, thus classifying malware families. First, features are extracted from byte files using two different Deep Convolutional Neural Networks (CNN). After that, essential and discriminative opcode features are selected using a wrapper-based mechanism, where Support Vector Machine (SVM) is used as a classifier. The idea is to construct a hybrid feature space by combining the different feature spaces to overcome the shortcoming of particular feature space and thus, reduce the chances of missing a malware. Finally, the hybrid feature space is used to train a Multilayer Perceptron, which classifies all nine different malware families. Experimental results show that proposed DLMD technique achieves log-loss of 0.09 for ten independent runs. Moreover, the proposed DLMD technique's performance is compared against different classifiers and shows its effectiveness in categorizing malware. The relevant code and database can be found at https://github.com/cyberhunters/Malware-Detection-Using-Machine-Learning. △ Less

Submitted 26 December, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

Comments: 21 pages, 8 figures, 11 tables

arXiv:1901.03559 [pdf, other]

An investigation of model-free planning

Authors: Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap

Abstract: The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been propos… ▽ More The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning. △ Less

Submitted 20 May, 2019; v1 submitted 11 January, 2019; originally announced January 2019.

arXiv:1810.06721 [pdf, other]

Optimizing Agent Behavior over Long Time Scales by Transporting Value

Authors: Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, Greg Wayne

Abstract: Humans spend a remarkable fraction of waking life engaged in acts of "mental time travel". We dwell on our actions in the past and experience satisfaction or regret. More than merely autobiographical storytelling, we use these event recollections to change how we will act in similar scenarios in the future. This process endows us with a computationally important ability to link actions and consequ… ▽ More Humans spend a remarkable fraction of waking life engaged in acts of "mental time travel". We dwell on our actions in the past and experience satisfaction or regret. More than merely autobiographical storytelling, we use these event recollections to change how we will act in similar scenarios in the future. This process endows us with a computationally important ability to link actions and consequences across long spans of time, which figures prominently in addressing the problem of long-term temporal credit assignment; in artificial intelligence (AI) this is the question of how to evaluate the utility of the actions within a long-duration behavioral sequence leading to success or failure in a task. Existing approaches to shorter-term credit assignment in AI cannot solve tasks with long delays between actions and consequences. Here, we introduce a new paradigm for reinforcement learning where agents use recall of specific memories to credit actions from the past, allowing them to solve problems that are intractable for existing algorithms. This paradigm broadens the scope of problems that can be investigated in AI and offers a mechanistic account of behaviors that may inspire computational models in neuroscience, psychology, and behavioral economics. △ Less

Submitted 21 December, 2018; v1 submitted 15 October, 2018; originally announced October 2018.

arXiv:1804.01128 [pdf, other]

Probing Physics Knowledge Using Tools from Developmental Psychology

Authors: Luis Piloto, Ari Weinstein, Dhruva TB, Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chia-chun Hung, Matt Botvinick

Abstract: In order to build agents with a rich understanding of their environment, one key objective is to endow them with a grasp of intuitive physics; an ability to reason about three-dimensional objects, their dynamic interactions, and responses to forces. While some work on this problem has taken the approach of building in components such as ready-made physics engines, other research aims to extract ge… ▽ More In order to build agents with a rich understanding of their environment, one key objective is to endow them with a grasp of intuitive physics; an ability to reason about three-dimensional objects, their dynamic interactions, and responses to forces. While some work on this problem has taken the approach of building in components such as ready-made physics engines, other research aims to extract general physical concepts directly from sensory data. In the latter case, one challenge that arises is evaluating the learning system. Research on intuitive physics knowledge in children has long employed a violation of expectations (VOE) method to assess children's mastery of specific physical concepts. We take the novel step of applying this method to artificial learning systems. In addition to introducing the VOE technique, we describe a set of probe datasets inspired by classic test stimuli from developmental psychology. We test a baseline deep learning system on this battery, as well as on a physics learning dataset ("IntPhys") recently posed by another research group. Our results show how the VOE technique may provide a useful tool for tracking physics knowledge in future research. △ Less

Submitted 3 April, 2018; originally announced April 2018.

arXiv:1803.10760 [pdf, other]

Unsupervised Predictive Memory in a Goal-Directed Agent

Authors: Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae, Piotr Mirowski, Joel Z. Leibo, Adam Santoro, Mevlana Gemici, Malcolm Reynolds, Tim Harley, Josh Abramson, Shakir Mohamed, Danilo Rezende, David Saxton, Adam Cain, Chloe Hillier, David Silver, Koray Kavukcuoglu, Matt Botvinick, Demis Hassabis, Timothy Lillicrap

Abstract: Animals execute goal-directed behaviours despite the limited range and scope of their sensors. To cope, they explore environments and store memories maintaining estimates of important information that is not presently available. Recently, progress has been made with artificial intelligence (AI) agents that learn to perform tasks from sensory input, even at a human level, by merging reinforcement l… ▽ More Animals execute goal-directed behaviours despite the limited range and scope of their sensors. To cope, they explore environments and store memories maintaining estimates of important information that is not presently available. Recently, progress has been made with artificial intelligence (AI) agents that learn to perform tasks from sensory input, even at a human level, by merging reinforcement learning (RL) algorithms with deep neural networks, and the excitement surrounding these results has led to the pursuit of related ideas as explanations of non-human animal learning. However, we demonstrate that contemporary RL algorithms struggle to solve simple tasks when enough information is concealed from the sensors of the agent, a property called "partial observability". An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences. △ Less

Submitted 28 March, 2018; originally announced March 2018.

arXiv:1711.07913 [pdf, ps, other]

doi 10.1109/KBEI.2017.8324889

Asymptotic Close To Optimal Joint Resource Allocation and Power Control in the Uplink of Two-cell Networks

Authors: Ata Khalili, Soroush Akhlaghi, Meysam Mirzaee

Abstract: In this paper, we investigate joint resource allocation and power control mechanisms for two-cell networks, where each cell has some sub-channels which should be allocated to some users. The main goal persuaded in the current work is finding the best power and sub-channel assignment strategies so that the associated sum-rate of network is maximized, while a minimum rate constraint is maintained by… ▽ More In this paper, we investigate joint resource allocation and power control mechanisms for two-cell networks, where each cell has some sub-channels which should be allocated to some users. The main goal persuaded in the current work is finding the best power and sub-channel assignment strategies so that the associated sum-rate of network is maximized, while a minimum rate constraint is maintained by each user. The underlying optimization problem is a highly non-convex mixed integer and non-linear problem which does not yield a trivial solution. In this regard, to tackle the problem, using an approximate function which is quite tight at moderate to high signal to interference plus noise ratio (SINR) region, the problem is divided into two disjoint sub-channel assignment and power allocation problems. It is shown that having fixed the allocated power of each user, the subchannel assignment can be thought as a well-known assignment problem which can be effectively solved using the so-called Hungarian method. Then, the power allocation is analytically derived. Furthermore, it is shown that the power can be chosen from two extremal points of the maximum available power or the minimum power satisfying the rate constraint. Numerical results demonstrate the superiority of the proposed approach over the random selection strategy as well as the method proposed in [3] which is regarded as the best known method addressed in the literature. △ Less

Submitted 21 November, 2017; originally announced November 2017.

arXiv:1612.03809 [pdf, other]

Generalizable Features From Unsupervised Learning

Authors: Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: Humans learn a predictive model of the world and use this model to reason about future events and the consequences of actions. In contrast to most machine predictors, we exhibit an impressive ability to generalize to unseen scenarios and reason intelligently in these settings. One important aspect of this ability is physical intuition(Lake et al., 2016). In this work, we explore the potential of u… ▽ More Humans learn a predictive model of the world and use this model to reason about future events and the consequences of actions. In contrast to most machine predictors, we exhibit an impressive ability to generalize to unseen scenarios and reason intelligently in these settings. One important aspect of this ability is physical intuition(Lake et al., 2016). In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution △ Less

Submitted 12 December, 2016; originally announced December 2016.

arXiv:1611.05839 [pdf, ps, other]

doi 10.1049/iet-com.2016.0470

Maximizing the minimum achievable secrecy rate of two-way relay networks using the null space beamforming method

Authors: Erfan khordad, Soroush Akhlaghi, Meysam Mirzaee

Abstract: This paper concerns maximizing the minimum achievable secrecy rate of a two-way relay network in the presence of an eavesdropper, in which two nodes aim to exchange messages in two hops, using a multi-antenna relay. Throughout the first hop, the two nodes simultaneously transmit their messages to the relay. In the second hop, the relay broadcasts a combination of the received information to the us… ▽ More This paper concerns maximizing the minimum achievable secrecy rate of a two-way relay network in the presence of an eavesdropper, in which two nodes aim to exchange messages in two hops, using a multi-antenna relay. Throughout the first hop, the two nodes simultaneously transmit their messages to the relay. In the second hop, the relay broadcasts a combination of the received information to the users such that the transmitted signal lies in the null space of the eavesdropper's channel; this is called null space beamforming (NSBF). The best NSBF matrix for maximizing the minimum achievable secrecy rate is studied, showing that the problem is not convex in general. To address this issue, the problem is divided into three sub-problems: a close-to-optimal solution is derived by using the semi-definite relaxation (SDR) technique. Simulation results demonstrate the superiority of the proposed method w.r.t. the most well-known method addressed in the literature. △ Less

Submitted 17 November, 2016; originally announced November 2016.

arXiv:1605.02688 [pdf, other]

Theano: A Python framework for fast computation of mathematical expressions

Authors: The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano , et al. (88 additional authors not shown)

Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu… ▽ More Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it. △ Less

Submitted 9 May, 2016; originally announced May 2016.

Comments: 19 pages, 5 figures

arXiv:1602.01783 [pdf, other]

Asynchronous Methods for Deep Reinforcement Learning

Authors: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

Abstract: We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural n… ▽ More We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input. △ Less

Submitted 16 June, 2016; v1 submitted 4 February, 2016; originally announced February 2016.

Journal ref: ICML 2016

arXiv:1503.01800 [pdf, other]

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Authors: Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar Gulcehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, Raul Chandias Ferrari, Mehdi Mirza, David Warde-Farley, Aaron Courville, Pascal Vincent, Roland Memisevic, Christopher Pal, Yoshua Bengio

Abstract: The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple… ▽ More The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset. △ Less

Submitted 29 March, 2015; v1 submitted 5 March, 2015; originally announced March 2015.

arXiv:1411.1784 [pdf, other]

Conditional Generative Adversarial Nets

Authors: Mehdi Mirza, Simon Osindero

Abstract: Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate… ▽ More Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels. △ Less

Submitted 6 November, 2014; originally announced November 2014.

arXiv:1406.2661 [pdf, other]

Generative Adversarial Networks

Authors: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This fram… ▽ More We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples. △ Less

Submitted 10 June, 2014; originally announced June 2014.

arXiv:1312.6211 [pdf, other]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Authors: Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, Yoshua Bengio

Abstract: Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, co… ▽ More Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated. △ Less

Submitted 3 March, 2015; v1 submitted 21 December, 2013; originally announced December 2013.

arXiv:1308.4214 [pdf, ps, other]

Pylearn2: a machine learning research library

Authors: Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, Yoshua Bengio

Abstract: Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summa… ▽ More Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially. △ Less

Submitted 19 August, 2013; originally announced August 2013.

Comments: 9 pages

arXiv:1307.0414 [pdf, other]

Challenges in Representation Learning: A report on three machine learning contests

Authors: Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Jingjing Xie, Lukasz Romaszko , et al. (3 additional authors not shown)

Abstract: The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kin… ▽ More The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions. △ Less

Submitted 1 July, 2013; originally announced July 2013.

Comments: 8 pages, 2 figures

arXiv:1302.4389 [pdf, other]

Maxout Networks

Authors: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Abstract: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model av… ▽ More We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. △ Less

Submitted 20 September, 2013; v1 submitted 18 February, 2013; originally announced February 2013.

Comments: This is the version of the paper that appears in ICML 2013

Journal ref: JMLR WCP 28 (3): 1319-1327, 2013

arXiv:1301.1701 [pdf, ps, other]

Secrecy Capacity of Two-Hop Relay Assisted Wiretap Channels

Authors: Meysam Mirzaee, Soroush Akhlaghi

Abstract: Incorporating the physical layer characteristics to secure communications has received considerable attention in recent years. Moreover, cooperation with some nodes of network can give benefits of multiple-antenna systems, increasing the secrecy capacity of such channels. In this paper, we consider cooperative wiretap channel with the help of an Amplify and Forward (AF) relay to transmit confident… ▽ More Incorporating the physical layer characteristics to secure communications has received considerable attention in recent years. Moreover, cooperation with some nodes of network can give benefits of multiple-antenna systems, increasing the secrecy capacity of such channels. In this paper, we consider cooperative wiretap channel with the help of an Amplify and Forward (AF) relay to transmit confidential messages from source to legitimate receiver in the presence of an eavesdropper. In this regard, the secrecy capacity of AF relying is derived, assuming the relay is subject to a peak power constraint. To this end, an achievable secrecy rate for Gaussian input is evaluated through solving a non-convex optimization problem. Then, it is proved that any rates greater than this secrecy rate is not achievable. To do this, the capacity of a genie-aided channel as an upper bound for the secrecy capacity of the underlying channel is derived, showing this upper bound is equal to the computed achievable secrecy rate with Gaussian input. Accordingly, the corresponding secrecy capacity is compared to the Decode and Forward (DF) strategy which is served as the benchmark in the current work. △ Less

Submitted 8 January, 2013; originally announced January 2013.

arXiv:1003.1796 [pdf]

Content based Zero-Watermarking Algorithm for Authentication of Text Documents

Authors: Zunera Jalil, Anwar M. Mirza, Maria Sabir

Abstract: Copyright protection and authentication of digital contents has become a significant issue in the current digital epoch with efficient communication mediums such as internet. Plain text is the rampantly used medium used over the internet for information exchange and it is very crucial to verify the authenticity of information. There are very limited techniques available for plain text watermarking… ▽ More Copyright protection and authentication of digital contents has become a significant issue in the current digital epoch with efficient communication mediums such as internet. Plain text is the rampantly used medium used over the internet for information exchange and it is very crucial to verify the authenticity of information. There are very limited techniques available for plain text watermarking and authentication. This paper presents a novel zero-watermarking algorithm for authentication of plain text. The algorithm generates a watermark based on the text contents and this watermark can later be extracted using extraction algorithm to prove the authenticity of text document. Experimental results demonstrate the effectiveness of the algorithm against tampering attacks identifying watermark accuracy and distortion rate on 10 different text samples of varying length and attacks. △ Less

Submitted 9 March, 2010; originally announced March 2010.

Comments: Pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947 5500, http://sites.google.com/site/ijcsis/

Showing 1–45 of 45 results for author: Mirza, M