Multimedia
See recent articles
- [1] arXiv:2407.02773 [pdf, html, other]
-
Title: OpenVNA: A Framework for Analyzing the Behavior of Multimodal Language Understanding System under Noisy ScenariosComments: 10 pages, 4 figures, to be published in ACL 2024 System Demonstration TrackSubjects: Multimedia (cs.MM)
We present OpenVNA, an open-source framework designed for analyzing the behavior of multimodal language understanding systems under noisy conditions. OpenVNA serves as an intuitive toolkit tailored for researchers, facilitating convenience batch-level robustness evaluation and on-the-fly instance-level demonstration. It primarily features a benchmark Python library for assessing global model robustness, offering high flexibility and extensibility, thereby enabling customization with user-defined noise types and models. Additionally, a GUI-based interface has been developed to intuitively analyze local model behavior. In this paper, we delineate the design principles and utilization of the created library and GUI-based web platform. Currently, OpenVNA is publicly accessible at \url{this https URL}, with a demonstration video available at \url{this https URL}.
- [2] arXiv:2407.02867 [pdf, html, other]
-
Title: Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph CompletionComments: Accepted by SIGIR 2024Subjects: Multimedia (cs.MM); Computation and Language (cs.CL)
A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at this https URL.
- [3] arXiv:2407.03027 [pdf, other]
-
Title: Differentially Processed Optimized Collaborative Rich Text EditorNishtha Jatana, Mansehej Singh, Charu Gupta, Geetika Dhand, Shaily Malik, Pankaj Dadheech, Nagender Aneja, Sandhya AnejaJournal-ref: Multimedia Tools and Applications (2024)Subjects: Multimedia (cs.MM)
A collaborative real-time text editor is an application that allows multiple users to edit a document simultaneously and merge their contributions automatically. It can be made collaborative by implementing a conflict resolution algorithm either on the client side (in peer-to-peer collaboration) or on the server side (when using web sockets and a central server to monitor state changes). Although web sockets are ideal for real-time text editors, using multiple collaborative editors on one connection can create problems. This is because a single web connection cannot monitor which user is collaborating on which application state, leading to unnecessary network queries and data being delivered to the wrong state. To address this issue, the current solution is to open multiple web socket connections, with one web socket per collaboration application. However, this can add significant overhead proportional to the number of apps utilized. In this study, we demonstrate an algorithm that enables using a single web socket for multiple collaborative applications in a collaborative editor. Our method involves modifying the socket's code to track which application's shared state is being worked on and by whom. This allows for the simultaneous collaboration of multiple states in real-time, with infinite users, without opening a different socket for each application. Our optimized editor showed an efficiency improvement of over 96% in access time duration. This approach can be implemented in other collaborative editors and web applications with similar architecture to improve performance and eliminate issues arising from network overload.
- [4] arXiv:2407.03178 [pdf, html, other]
-
Title: Relating CNN-Transformer Fusion Network for Change DetectionComments: accepted by IEEE Conference on Multimedia ExpoSubjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf{(1)} an early fusion backbone to exploit both spatial and temporal features early on, \textbf{(2)} a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf{(3)} a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf{(4)} an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet's clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.
New submissions for Thursday, 4 July 2024 (showing 4 of 4 entries )
- [5] arXiv:2407.02798 (cross-list from cs.HC) [pdf, html, other]
-
Title: Game-Based Discovery: Harnessing Mini-Games within Primary Games for Scientific Data Collection and Problem SolvingComments: 6 pages, 4 figuresSubjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
In the popular video game Batman: Arkham Knight, produced by Rocksteady Studios and released in 2015, the primary protagonist of the game is Batman, a vigilante dressed as a bat, fighting crime from the shadows in the fictitious city of Gotham. The game involves a real-world player who takes up the role of Batman to solve a peculiar side mission wherein they have to reconstruct the clean DNA sequence of a human and separate it from mutant DNA to manufacture an antidote to cure the villain. Although this is undoubtedly a fascinating part of the game, one that was absent in previous Batman games, it showcases an interesting notion of using mini-games embedded within primary games to achieve a particular real-world research objective. Although the DNA data used in this case was not real, there are multiple such instances in video games where mini-games have been used for an underlying motive besides entertainment. Based on popular case studies incorporating a similar method, this study characterizes the methodology of designing mini-games within primary games for research purposes into a descriptive framework, highlighting the process's advantages and limitations. It is concluded that these mini-games not only facilitate a deeper understanding of complex scientific concepts but also accelerate data processing and analysis by leveraging crowd-sourced human intuition and pattern recognition capabilities. This paper argues for strategically incorporating miniaturized, gamified elements into established video games that are mainly intended for recreational purposes.
- [6] arXiv:2407.03104 (cross-list from cs.CV) [pdf, html, other]
-
Title: KeyVideoLLM: Towards Large-scale Video Keyframe SelectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.
- [7] arXiv:2407.03107 (cross-list from cs.HC) [pdf, other]
-
Title: Design of a UE5-based digital twin platformSubjects: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Multimedia (cs.MM)
Aiming at the current mainstream 3D scene engine learning and building cost is too high, this thesis proposes a digital twin platform design program based on Unreal Engine 5 (UE5). It aims to provide a universal platform construction design process to effectively reduce the learning cost of large-scale scene construction. Taking an actual project of a unit as an example, the overall cycle work of platform building is explained, and the digital twin and data visualization technologies and applications based on UE5 are analyzed. By summarizing the project implementation into a process approach, the standardization and operability of the process pathway is improved.
- [8] arXiv:2407.03188 (cross-list from cs.SD) [pdf, html, other]
-
Title: MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song GenerationComments: 19 pages, 5 figuresSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.
Cross submissions for Thursday, 4 July 2024 (showing 4 of 4 entries )
- [9] arXiv:2310.04673 (replaced) [pdf, html, other]
-
Title: LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTZhihao Du, Jiaming Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang ZhangComments: 10 pages, work in progressSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.
- [10] arXiv:2402.18844 (replaced) [pdf, html, other]
-
Title: Deep learning for 3D human pose estimation and mesh recovery: A surveySubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
3D human pose estimation and mesh recovery have attracted widespread research interest in many areas, such as computer vision, autonomous driving, and robotics. Deep learning on 3D human pose estimation and mesh recovery has recently thrived, with numerous methods proposed to address different problems in this area. In this paper, to stimulate future research, we present a comprehensive review of recent progress over the past five years in deep learning methods for this area by delving into over 200 references. To the best of our knowledge, this survey is arguably the first to comprehensively cover deep learning methods for 3D human pose estimation, including both single-person and multi-person approaches, as well as human mesh recovery, encompassing methods based on explicit models and implicit representations. We also present comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions. A regularly updated project page can be found at this https URL.
- [11] arXiv:2404.00621 (replaced) [pdf, html, other]
-
Title: Multimodal Pretraining, Adaptation, and Generation for Recommendation: A SurveyQijiong Liu, Jieming Zhu, Yanting Yang, Quanyu Dai, Zhaocheng Du, Xiao-Ming Wu, Zhou Zhao, Rui Zhang, Zhenhua DongComments: Accepted by KDD 2024. See our tutorial materials at this https URLSubjects: Information Retrieval (cs.IR); Multimedia (cs.MM)
Personalized recommendation serves as a ubiquitous channel for users to discover information tailored to their interests. However, traditional recommendation models primarily rely on unique IDs and categorical features for user-item matching, potentially overlooking the nuanced essence of raw item contents across multiple modalities such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, especially in multimedia services like news, music, and short-video platforms. The recent advancements in large multimodal models offer new opportunities and challenges in developing content-aware recommender systems. This survey seeks to provide a comprehensive exploration of the latest advancements and future trajectories in multimodal pretraining, adaptation, and generation techniques, as well as their applications in enhancing recommender systems. Furthermore, we discuss current open challenges and opportunities for future research in this dynamic domain. We believe that this survey, alongside the curated resources, will provide valuable insights to inspire further advancements in this evolving landscape.
- [12] arXiv:2407.02411 (replaced) [pdf, html, other]
-
Title: Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMsComments: arXiv admin note: substantial text overlap with arXiv:2403.13507Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerning the video content and description, in response to specific queries. By imperceptibly embedding watermarks into key video frames with multi-modal flow-based losses, our method preserves the viewing experience while preventing misuse by video-based LLMs. Extensive experiments show that Video Watermarking significantly reduces the comprehensibility of videos with various video-based LLMs, demonstrating both stealth and robustness. In essence, our method provides a solution for securing video content, ensuring its integrity and confidentiality in the face of evolving video-based LLMs technologies.