(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–4 of 4 results for author: Pantazopoulos, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19297  [pdf, other

    cs.CV

    Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

    Authors: Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia

    Abstract: Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that e… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2405.04403  [pdf, other

    cs.CV cs.CL

    Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

    Authors: Georgios Pantazopoulos, Amit Parekh, Malvina Nikandrou, Alessandro Suglia

    Abstract: Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinc… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  3. arXiv:2404.13594  [pdf, other

    cs.CV cs.AI

    Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

    Authors: Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi

    Abstract: An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial under… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: NAACL 2024

  4. arXiv:2311.04067  [pdf, other

    cs.LG cs.AI cs.CV

    Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

    Authors: Georgios Pantazopoulos, Malvina Nikandrou, Amit Parekh, Bhathiya Hemanthage, Arash Eshghi, Ioannis Konstas, Verena Rieser, Oliver Lemon, Alessandro Suglia

    Abstract: Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action predi… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023