(Translated by https://www.hiragana.jp/)
Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

Yang Liu1, Xun Zhang1, Jiale Du1, Xinbo Gao1,2, Jungong Han3
1Xidian University, Xi’an, China
2Chongqing University of Posts and Telecommunications, Chongqing, China
3Tsinghua University, Beijing, China
yangl@xidian.edu.cn, xunz724@gmail.com, 23011211070@stu.xidian.edu.cn,
xbgao@mail.xidian.edu.cn, jungonghan77@gmail.com
Abstract

Zero-shot Learning (ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning (AV-GZSL) has aroused researchers’ great interest in which intricate relations within triple modalities (audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution (OOD) detection based AV-GZSL method (EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

1 Introduction

Multimodal learning has become one of the most trending research topics nowadays involving vision-language tasks including visual question answering [7], image captioning [34], cross-modal retrieval [39], visual entailment [37], and so on, and audio-language tasks like audio captioning [22], emotion recognition [9], speech translation [31], etc., since vision, audio, and language are the most common signal sources in real world. However, it is undoubtedly an overwhelming burden to collect and label quantities of videos/pictures, audio signals, and natural language corpus to fulfill aforementioned tasks, where Zero-shot Learning (ZSL) emerges as an attainable approach to get rid of redundant data collection by mining auxiliary information like semantic attributes and word embeddings to achieve knowledge transfer without contact with unseen samples. Concerning video-audio multimodal tasks, AV-GZSL dependent on the fusion of audio and visual input, and natural language description is the chosen methodology to cope with classification, retrieval, and other problems.

Refer to caption
Figure 1: Harmonic mean (%) evaluating GZSL performance of our EZ-AVOOD model and other comparison methods on three datasets. EZ-AVOOD (the red bar) consistently outperforms the rest opponents with a lead margin up to 5% on the UCF-GZSL benchmark.

[28] sets out to apply GZSL to audio-visual features. AVGZSLNet [21] leverages late fusion for integrating information from bi-modalities. Most of the subsequent works have focused on efficient fusion of audio-visual feature representations: AVCA [25] initially utilizes cross-attention [33] block which is inherited by [24, 12, 27]. Additionally, generative approach AVFS [40] facilitates contrastive learning with the synthesized negative unseen samples. However, domain shift problem is always unavoidable for ZSL with the bias towards seen classes, AV-GZSL simply behaves the same way, especially for embedding-based approaches. Generative solutions aim to reduce the bias with synthesized unseen samples and Calibrated Stacking [4] searches extra hyper-parameters to suppress the model’s tendency towards seen samples. Considering that generative model is unstable to train while Calibrated Stacking’s effect is limited, in this paper we propose an extremely simple OOD detection based AV-GZSL method named EZ-AVOOD to effectively alleviate the bias problem in an explicit way. Figure 1 fully exhibits the competence of our model where our model outperforms all the contrasting methods on three audio-visual benchmarks.

Typically, the fundamental framework of OOD approaches in ZSL shares the universal three parts: an OOD detector, one seen classifier and an unseen classifier, and each of them plays a vital role in the final GZSL performance. Additionally, these three components indeed work quite independently and consequently they are relatively substitutable when stronger components are developed. Unlike previous work which utilizes generative method like WGAN-GP [1, 10] to first synthesize unseen samples features, and then jointly train a completely new OOD detector with seen samples and synthesized unseen samples [35], the proposed EZ-AVOOD model accomplishes OOD detection without training the OOD detector with the aid of class-specific logits produced by supervised seen classifier and class-agnostic information hidden in feature subspace. More specifically, our method only needs to train the seen classifier composed of one MLP (Multilayer Perceptrons), and fine-tune an existing embedding-based method [14] as the unseen expert classifier, which considerably reduces the complexity of the entire model. Moreover, the unseen classifier of our EZ-AVOOD can be substituted by arbitrary optimal AV-GZSL method for future researchers to achieve higher overall GZSL performance. The main contributions of this paper can be summarized into the following three aspects:

  • We proposed an extremely simple OOD detection based model EZ-AVOOD to address AV-GZSL problem with OOD score derived from class-specific logits and class-agnostic feature subspace instead of training a completely new OOD detector.

  • We comprehensively demonstrate the effectiveness of EZ-AVOOD model through practical ZSL and GZSL experiments on three different audio-visual benchmarks and observe a substantial enhancement in comparison with the current state-of-the-art methodologies.

  • The proposed effective OOD detection method EZ-OOD possesses strong compatibility with existing AV-GZSL approaches, which indicates that future researchers could effortlessly improve the GZSL performance by simply replacing the unseen expert with more powerful substitutes.

2 Related Work

2.1 Audio-visual Generalized Zero-shot Learning

Embedding approaches mapping the videos, audio signals, and text information into a shared common space to get joint feature representations aimed for subsequent classification or retrieval tasks enjoy wide popularity in AV-GZSL. Among them, prior CJME [28] employs triplet loss to restrict the distance between audio-visual features and class embedding on the proposed AudioSetZSL [28] dataset. Recently, TCaF [24] proposes a temporal cross-attention framework enhanced from AVCA [25] by factoring in the temporal information. Hyperbolic [12] method employs hyperbolic alignment loss and cross-attention module to further improve the separability of joint feature representations. EZ-AVGZL [27] utilizes class embedding optimization to achieve better discriminability of class embeddings while maintaining their original semantics. ClipClap [14] enhances the feature quality by pre-trained CLIP [29] and CLAP [23] models. As for the generative method, AVFS [40](Audio-Visual Feature Synthesis) generates unseen samples to facilitate contrastive training.

2.2 Out-of-distribution Detection and Post-hoc Methods

Out-of-distribution (OOD) detection is proposed to secure the smooth deployment of machine learning models in real-world scenarios since these models are typically trained and validated in close-world settings. OOD detection is implemented with different OOD scores (scalars) produced by ID (in-distribution) and OOD samples on which certain thresholds are applied to finish ID-OOD separation.

Post-hoc methods for OOD detection typically involve training a model or a classifier with ID data first and subsequently, the model with frozen parameters is converted into an OOD detector during the test stage, which is cost-effective compared with training an extra detector. Post-hoc methods compute an OOD score based on the pre-trained model’s output. One simple solution MSP [11] utilizes the maximum softmax probability as the OOD score to distinguish ID and OOD samples. To improve the discriminability of the softmax score, ODIN [18] employs temperature scaling to make the softmax score distribution more uniform and also applies input perturbation. Logits-based Energy score [20, 19] makes use of output logits in conjunction with the LogSumExp function. Features-based methods [8, 32, 13] like WDiscOOD [6] leverages LDA (linear discriminant analysis) [38] to enlarge the inter-class discrepancy and reduce the intra-class gap for better ID-OOD separation.

3 Proposed Approach

Refer to caption
Figure 2: The general framework of EZ-AVOOD. Four key modules “Feature Extractor”, “OOD Detector”, “Seen Classifier” and “Unseen Classifier” make up the complete model. Parameter-fixed feature extractor simply produces audio-visual features 𝒂𝒗direct-sum𝒂𝒗\boldsymbol{a}\oplus\boldsymbol{v}bold_italic_a ⊕ bold_italic_v (direct-sum\oplus represents concatenation operation) and text embeddings 𝒕𝒕\boldsymbol{t}bold_italic_t without further optimization. Seen classifier and OOD detector are implemented with two identical MLPs, which means they share the same copy of parameters and need to train only one of them to make two modules work. The process of OOD score formulation is illustrated in Figure 3. At evaluation stage, OOD detector distinguishes seen and unseen samples and input them to the trained seen expert and unseen expert classifiers respectively (red arrows).

3.1 Problem Statement of Audio-visual GZSL

AV-GZSL aims to efficiently recognize previously seen and even unseen video-audio combinations through the set of seen (training) audio-visual events together with human-readable text descriptions. Thus, we denote samples from seen classes with 𝑺=(𝒂is,𝒗is,𝒕is,yis)i{1,,M}𝑺subscriptsuperscriptsubscript𝒂𝑖𝑠superscriptsubscript𝒗𝑖𝑠superscriptsubscript𝒕𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖1𝑀\boldsymbol{S}=(\boldsymbol{a}_{i}^{s},\boldsymbol{v}_{i}^{s},\boldsymbol{t}_{% i}^{s},y_{i}^{s})_{i\in\{1,\cdots,M\}}bold_italic_S = ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ { 1 , ⋯ , italic_M } end_POSTSUBSCRIPT , where M𝑀Mitalic_M seen samples in total consist of audio features 𝒂issuperscriptsubscript𝒂𝑖𝑠\boldsymbol{a}_{i}^{s}bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, visual features 𝒗issuperscriptsubscript𝒗𝑖𝑠\boldsymbol{v}_{i}^{s}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and textual description 𝒕issuperscriptsubscript𝒕𝑖𝑠\boldsymbol{t}_{i}^{s}bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as well as the corresponding ground truth class label yissuperscriptsubscript𝑦𝑖𝑠y_{i}^{s}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Likewise, unseen dataset is denoted as 𝑼=(𝒂ju,𝒗ju,𝒕ju,yju)j{1,,K}𝑼subscriptsuperscriptsubscript𝒂𝑗𝑢superscriptsubscript𝒗𝑗𝑢superscriptsubscript𝒕𝑗𝑢superscriptsubscript𝑦𝑗𝑢𝑗1𝐾\boldsymbol{U}=(\boldsymbol{a}_{j}^{u},\boldsymbol{v}_{j}^{u},\boldsymbol{t}_{% j}^{u},y_{j}^{u})_{j\in\{1,\cdots,K\}}bold_italic_U = ( bold_italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ { 1 , ⋯ , italic_K } end_POSTSUBSCRIPT with K𝐾Kitalic_K samples and notably 𝑺𝑼=𝑺𝑼{\boldsymbol{S}}\cap{\boldsymbol{U}}=\varnothingbold_italic_S ∩ bold_italic_U = ∅. The number of seen classes labels is Cssubscript𝐶𝑠C_{s}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT: 𝒀s=(y1s,,yMs){1,,Cs}superscript𝒀𝑠superscriptsubscript𝑦1𝑠superscriptsubscript𝑦𝑀𝑠1subscript𝐶𝑠{\boldsymbol{Y}^{s}}={({y_{1}^{s}},\cdots,{y}_{M}^{s})}\in{\{{1,\cdots,C_{s}}\}}bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ∈ { 1 , ⋯ , italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } , and the number of unseen classes is Cusubscript𝐶𝑢C_{u}italic_C start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT: 𝒀u=(y1u,,yKu){1,,Cu}superscript𝒀𝑢superscriptsubscript𝑦1𝑢superscriptsubscript𝑦𝐾𝑢1subscript𝐶𝑢{\boldsymbol{Y}^{u}}={({y_{1}^{u}},\cdots,{y}_{K}^{u})}\in{\{{1,\cdots,C_{u}}\}}bold_italic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) ∈ { 1 , ⋯ , italic_C start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }. Hence,for the ZSL task, AV-GZSL learns the model fZSL:𝑿𝒀u:subscript𝑓𝑍𝑆𝐿𝑿superscript𝒀𝑢f_{ZSL}:\boldsymbol{X}\rightarrow{{{\boldsymbol{Y}}^{u}}}italic_f start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT : bold_italic_X → bold_italic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT to classify unseen samples only, where 𝑿=(𝒂z,𝒗z)𝑿subscript𝒂𝑧subscript𝒗𝑧\boldsymbol{X}=(\boldsymbol{a}_{z},\boldsymbol{v}_{z})bold_italic_X = ( bold_italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) denotes the test dataset and in GZSL, the classifier fGZSL:𝑿𝒀u𝒀s:subscript𝑓𝐺𝑍𝑆𝐿𝑿superscript𝒀𝑢superscript𝒀𝑠f_{GZSL}:\boldsymbol{X}\rightarrow{\boldsymbol{Y}^{u}\cup\boldsymbol{Y}^{s}}italic_f start_POSTSUBSCRIPT italic_G italic_Z italic_S italic_L end_POSTSUBSCRIPT : bold_italic_X → bold_italic_Y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∪ bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT aims to classify seen and unseen examples.

3.2 Model Architecture

The proposed model is depicted in Figure 2 and there are three essential components in our model: the OOD detector, the supervised seen classifier, and the unseen classifier adapted from an existing embedding method. The greatest strength of our method is that there is no need to train a new OOD detector, since our OOD detector shares the same parameters with the seen classifier, which means the supervised classifier trained with seen samples also serves as the OOD detector. Therefore, EZ-AVOOD significantly reduces the complexity of the model and brings a considerable decrease in the computational overhead and training time.

3.2.1 Seen Classifier

To cope with seen samples, a vanilla and efficient 3-layer MLP optimized with Cross Entropy Loss 𝓛xent=CrossEntropy(𝒙,y(𝒙))subscript𝓛𝑥𝑒𝑛𝑡CrossEntropy𝒙𝑦𝒙\boldsymbol{\mathcal{L}}_{xent}=\textbf{CrossEntropy}(\boldsymbol{x},y(% \boldsymbol{x}))bold_caligraphic_L start_POSTSUBSCRIPT italic_x italic_e italic_n italic_t end_POSTSUBSCRIPT = CrossEntropy ( bold_italic_x , italic_y ( bold_italic_x ) ) is adopted as the seen expert classifier, where 𝒙𝒙\boldsymbol{x}bold_italic_x denotes the joint audio-visual features from seen classes constructed through simple concatenation: 𝒙=𝒂𝒗𝒙direct-sum𝒂𝒗\boldsymbol{x}=\boldsymbol{a}\oplus\boldsymbol{v}bold_italic_x = bold_italic_a ⊕ bold_italic_v. Moreover, once the seen classifier is trained, it can be leveraged as the OOD detector, and more details are thoroughly elaborated in the next part.

Refer to caption
Figure 3: The process of the EZ-OOD score formulation. During training phase, “Residual Subspace” is derived from the eigen-decomposition on all seen samples features matrix. At test time (pink arrows), concatenated audio-visual feature 𝒂𝒗direct-sum𝒂𝒗\boldsymbol{a}\oplus\boldsymbol{v}bold_italic_a ⊕ bold_italic_v projects onto the residual subspace to get “Residual Score” and “Energy Score” is calculated with the logits of the test sample produced by the MLP (the trained seen classifier actually). The final OOD score is defined by the weighted sum of energy score and residual score.

3.2.2 Out-of-distribution Detector

We adopt post-hoc idea to design OOD algorithm tackling seen-unseen separation in AV-GZSL problem. Since the output of pre-trained model usually includes high-dimensional features, logits, or Softmax probability and we choose to exploit the intrinsic information held in class-specific logits and feature representation to construct OOD score. Consequently, the trained seen classifier now becomes the “OOD detector” in our method to output class-dependent logits. The proposed extremely simple OOD detection method is named as “EZ-OOD” and the formulation pipeline is illustrated in Figure 3. The EZ-OOD score consists of Energy Score calculated by logits and Residual Score derived from residual subspace.

Class-specific Logits and Energy Score  Seen classifier takes the high-dimensional fused audio-visual features as input and produces the logits corresponding to specific seen classes labels. We adopt the widely used Energy Score function LogSumExp 𝑬(𝒙;𝒍)𝑬𝒙𝒍\boldsymbol{E(x;\,l)}bold_italic_E bold_( bold_italic_x bold_; bold_italic_l bold_) as one vital part of our OOD score, mapping the logits 𝒍𝒍\boldsymbol{l}bold_italic_l of sample 𝒙𝒙\boldsymbol{x}bold_italic_x to a scalar:

𝑬(𝒙;𝒍)=logi=1Celi(𝒙),𝑬𝒙𝒍superscriptsubscript𝑖1𝐶superscript𝑒subscript𝑙𝑖𝒙\boldsymbol{E}(\boldsymbol{x};\,\boldsymbol{l})=-\log\sum_{i=1}^{C}e^{l_{i}(% \boldsymbol{x})}~{},bold_italic_E ( bold_italic_x ; bold_italic_l ) = - roman_log ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUPERSCRIPT , (1)

where C𝐶Citalic_C is the number of seen classes and li(𝒙)subscript𝑙𝑖𝒙l_{i}(\boldsymbol{x})italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) is the logit of class-i𝑖iitalic_i in correspondence with sample 𝒙𝒙\boldsymbol{x}bold_italic_x. Negative inversion 𝑬(𝒙)=𝑬(𝒙;𝒍)𝑬𝒙𝑬𝒙𝒍\boldsymbol{E}(\boldsymbol{x})=-\boldsymbol{E}(\boldsymbol{x};\,\boldsymbol{l})bold_italic_E ( bold_italic_x ) = - bold_italic_E ( bold_italic_x ; bold_italic_l ) is the score practically used to ensure that seen samples produce higher scores, which is consistent with the tradition in OOD detection.

Feature Representation and Class-agnostic Residual Score  Here 𝒙D𝒙superscript𝐷\boldsymbol{x}\in{\mathbb{R}}^{D}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is the D-dimensional fused bi-modal sample feature and 𝑿𝑿\boldsymbol{X}bold_italic_X denotes the audio-visual feature matrix of all seen samples. Therefore, the principal subspace P𝑃{P}italic_P is defined by the N𝑁Nitalic_N-dimensional space spanned by the eigenvetors corresponding to the top-N𝑁Nitalic_N eigenvalues of matrix 𝑿T𝑿superscript𝑿𝑇𝑿\boldsymbol{X}^{T}\boldsymbol{X}bold_italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_X and the the residual subspace Psuperscript𝑃perpendicular-to{P}^{\perp}italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT is the orthogonal complements of the space P𝑃{P}italic_P. Thus we have 𝒙=𝒙P+𝒙P𝒙superscript𝒙𝑃superscript𝒙superscript𝑃perpendicular-to\boldsymbol{x}=\boldsymbol{x}^{P}+\boldsymbol{x}^{{P}^{\perp}}bold_italic_x = bold_italic_x start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT + bold_italic_x start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT where 𝒙Psuperscript𝒙𝑃\boldsymbol{x}^{P}bold_italic_x start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT is the projection of sample feature 𝒙𝒙\boldsymbol{x}bold_italic_x onto subspace P𝑃Pitalic_P and 𝒙Psuperscript𝒙superscript𝑃perpendicular-to\boldsymbol{x}^{{P}^{\perp}}bold_italic_x start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the mapping on Psuperscript𝑃perpendicular-to{P}^{\perp}italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Suppose the eigen-decomposition of matrix 𝑿T𝑿superscript𝑿𝑇𝑿\boldsymbol{X}^{T}\boldsymbol{X}bold_italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_X is

𝑿T𝑿=𝑾𝚲𝑾T,superscript𝑿𝑇𝑿𝑾𝚲superscript𝑾𝑇\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{W{\Lambda}W}^{T}~{},bold_italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_X = bold_italic_W bold_Λ bold_italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (2)

where 𝑾𝑾\boldsymbol{W}bold_italic_W refers to a set of standard orthogonal bases that are arranged according to the decreasing order of the eigenvalues within the diagonal matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ. N𝑁Nitalic_N-dimensional principal subspace P𝑃Pitalic_P is defined by the first N𝑁Nitalic_N column vectors of 𝑾𝑾\boldsymbol{W}bold_italic_W, and span of the (N+1)𝑁1(N+1)( italic_N + 1 )-th column to the D𝐷Ditalic_D-th column vectors in 𝑾𝑾\boldsymbol{W}bold_italic_W is the residual subspace Psuperscript𝑃perpendicular-to{P}^{\perp}italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Then the matrix 𝑾𝑾\boldsymbol{W}bold_italic_W can be separated into matrix 𝑸D×N𝑸superscript𝐷𝑁\boldsymbol{Q}\in{\mathbb{R}}^{D\times{N}}bold_italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT and matrix 𝑶D×(DN)𝑶superscript𝐷𝐷𝑁\boldsymbol{O}\in{\mathbb{R}}^{D\times(D-N)}bold_italic_O ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × ( italic_D - italic_N ) end_POSTSUPERSCRIPT formed by the last (DN)𝐷𝑁(D-N)( italic_D - italic_N ) eigenvetors, and we can get

𝒙P=𝑸T𝒙;𝒙P=𝑶T𝒙.formulae-sequencesuperscript𝒙𝑃superscript𝑸𝑇𝒙superscript𝒙superscript𝑃perpendicular-tosuperscript𝑶𝑇𝒙\boldsymbol{x}^{P}={\boldsymbol{Q}}^{T}\boldsymbol{x};\ \boldsymbol{x}^{{P}^{% \perp}}={\boldsymbol{O}}^{T}\boldsymbol{x}~{}.bold_italic_x start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = bold_italic_Q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x ; bold_italic_x start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = bold_italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x . (3)

Given that the principal subspace and the residual subspace are constructed based on the feature representations of all training samples, they just ignore the information specific to individual seen category, namely characteristics hidden in these two subspaces are class-agnostic. Moreover, we argue that seen samples are relatively closer to principal subspace while deviate a lot from residual subspace. Therefore we define the Residual Score bmR(𝒙)𝑏𝑚𝑅𝒙bm{R}(\boldsymbol{x})italic_b italic_m italic_R ( bold_italic_x ) as the norm of 𝒙Psuperscript𝒙superscript𝑃perpendicular-to\boldsymbol{x}^{{P}^{\perp}}bold_italic_x start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT:

𝑹(𝒙)=𝒙P=(𝒙T𝑶𝑶T𝒙)1/2.𝑹𝒙normsuperscript𝒙superscript𝑃perpendicular-tosuperscriptsuperscript𝒙𝑇𝑶superscript𝑶𝑇𝒙12\boldsymbol{R}(\boldsymbol{x})=-\|\boldsymbol{x}^{{P}^{\perp}}\|=-\left(% \boldsymbol{x}^{T}{\boldsymbol{O}}{\boldsymbol{O}}^{T}\boldsymbol{x}\right)^{1% /2}~{}.bold_italic_R ( bold_italic_x ) = - ∥ bold_italic_x start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ = - ( bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_O bold_italic_O start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT . (4)

Just like the Energy Score, we take a minus norm to make ID samples produce higher OOD scores.

EZ-OOD Score Formulation  The final OOD score 𝑺(𝒙)𝑺𝒙\boldsymbol{S}(\boldsymbol{x})bold_italic_S ( bold_italic_x ) is formulated by the weighted sum of energy score and residual score to unify the class-specific and class-agnostic information for better ID-OOD separation:

𝑺(𝒙)=𝑬(𝒙)+γ𝑹(𝒙),𝑺𝒙𝑬𝒙𝛾𝑹𝒙\boldsymbol{S}(\boldsymbol{x})=\boldsymbol{E}(\boldsymbol{x})+\gamma% \boldsymbol{R}(\boldsymbol{x})~{},bold_italic_S ( bold_italic_x ) = bold_italic_E ( bold_italic_x ) + italic_γ bold_italic_R ( bold_italic_x ) , (5)

where γ𝛾\gammaitalic_γ is the weight hyper-parameter to balance the scale of these two different scores and enhance the overall OOD detection performance.

OOD detection process is defined as below, A𝐴Aitalic_A is the binary classification outcome,

Aλ(𝒙)={ Seen 𝑺(𝒙)λ Unseen 𝑺(𝒙)<λ,subscript𝐴𝜆𝒙cases Seen 𝑺𝒙𝜆 Unseen 𝑺𝒙𝜆A_{\lambda}(\boldsymbol{x})=\begin{cases}\text{ Seen }&\boldsymbol{S}(% \boldsymbol{x})\geq\lambda\\ \text{ Unseen }&\boldsymbol{S}(\boldsymbol{x})<\lambda~{},\end{cases}italic_A start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( bold_italic_x ) = { start_ROW start_CELL Seen end_CELL start_CELL bold_italic_S ( bold_italic_x ) ≥ italic_λ end_CELL end_ROW start_ROW start_CELL Unseen end_CELL start_CELL bold_italic_S ( bold_italic_x ) < italic_λ , end_CELL end_ROW (6)

where λ𝜆\lambdaitalic_λ is the threshold and samples possess higher 𝑺(𝒙)𝑺𝒙\boldsymbol{S}(\boldsymbol{x})bold_italic_S ( bold_italic_x ) are tend to be treated as seen classes. The threshold is uniquely determined by the training samples and has nothing to do with the test data.

Methods VGGSound-GZSLcls UCF-GZSLcls ActivityNet-GZSLcls
acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT
CJME [28] 11.96 5.41 7.45 6.84 48.18 17.68 25.87 20.46 16.06 9.13 11.64 9.92
AVGZSLNet [21] 13.02 2.88 4.71 5.44 56.26 34.37 42.67 35.66 14.81 11.11 12.70 12.39
AVCA [25] 32.47 6.81 11.26 8.16 34.90 38.67 36.69 38.67 24.04 19.88 21.76 20.88
Hyper-multiple [12] 21.99 8.12 11.87 8.47 43.52 39.77 41.56 40.28 20.52 21.30 20.90 22.18
ClipClap [14] 29.68 11.12 16.18 11.53 77.14 43.91 55.97 46.96 45.98 20.06 27.93 22.76
EZ-AVOOD (Ours) 39.33 11.84 18.21 13.28 83.53 48.01 60.97 50.92 41.56 21.06 27.95 25.20
Table 1: Comparison with existing state-of-the-art methods on VGGSound-GZSLcls, UCF-GZSLcls and ActivityNet-GZSLcls datasets. Performances in percentage of GZSL (acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT/acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT/H) and ZSL (accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT) are reported. For fair comparison, results of all five baseline methods are obtained using audio-visual features and text embeddings extracted by CLIP and CLAP models. Bold values represent the best results and the second-ranked numbers are underlined.

3.2.3 Unseen Classifier

Here we fine-tune the ClipClap [14] model to enhance the unseen classes average accuracy for the purpose of improving final GZSL performance. As illustrated in Figure 2, the general framework consists of two branches of Encoder-Encoder-Decoder pipeline to deal with concatenated audio-visual features and fused text embeddings, respectively. With respect to the feature extraction, to be specific, visual features 𝒗𝒗\boldsymbol{v}bold_italic_v and part of concatenated text embeddings 𝒕𝒗superscript𝒕𝒗\boldsymbol{t^{v}}bold_italic_t start_POSTSUPERSCRIPT bold_italic_v end_POSTSUPERSCRIPT are extracted by vision-language pre-trained model CLIP [29] and CLAP [23] model dedicated to audio-language tasks produces audio features 𝒂𝒂\boldsymbol{a}bold_italic_a and another part of text embeddings 𝒕𝒂superscript𝒕𝒂\boldsymbol{t^{a}}bold_italic_t start_POSTSUPERSCRIPT bold_italic_a end_POSTSUPERSCRIPT.

The first encoder block 𝑶encsubscript𝑶𝑒𝑛𝑐\boldsymbol{O}_{enc}bold_italic_O start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT from audio-visual branch takes concatenated features 𝒙=𝒂𝒗𝒙direct-sum𝒂𝒗\boldsymbol{x}=\boldsymbol{a}\oplus\boldsymbol{v}bold_italic_x = bold_italic_a ⊕ bold_italic_v as input and outputs the multimodal sample features 𝒐𝒐\boldsymbol{o}bold_italic_o :

𝒐=𝑶enc(𝒙).𝒐subscript𝑶𝑒𝑛𝑐𝒙\boldsymbol{o}=\boldsymbol{O}_{enc}(\boldsymbol{x})~{}.bold_italic_o = bold_italic_O start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( bold_italic_x ) . (7)

In the same way, we get unified text embeddings 𝒘𝒘\boldsymbol{w}bold_italic_w with encoder 𝑾encsubscript𝑾𝑒𝑛𝑐\boldsymbol{W}_{enc}bold_italic_W start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT:

𝒘=𝑾enc(𝒕𝒂𝒕𝒗).𝒘subscript𝑾𝑒𝑛𝑐direct-sumsuperscript𝒕𝒂superscript𝒕𝒗\boldsymbol{w}=\boldsymbol{W}_{enc}(\boldsymbol{t^{a}}\oplus\boldsymbol{t^{v}}% )~{}.bold_italic_w = bold_italic_W start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( bold_italic_t start_POSTSUPERSCRIPT bold_italic_a end_POSTSUPERSCRIPT ⊕ bold_italic_t start_POSTSUPERSCRIPT bold_italic_v end_POSTSUPERSCRIPT ) . (8)

With multimodal sample features 𝒐𝒐\boldsymbol{o}bold_italic_o and fused text embeddings 𝒘𝒘\boldsymbol{w}bold_italic_w as the input of the rest two simple and effective Encoder-Decoder compound modules, we have the following formulations:

𝜽o=𝑶proj(𝒐);𝝆o=𝑫o(𝜽o),formulae-sequencesubscript𝜽𝑜subscript𝑶𝑝𝑟𝑜𝑗𝒐subscript𝝆𝑜subscript𝑫𝑜subscript𝜽𝑜\boldsymbol{\theta}_{o}=\boldsymbol{O}_{proj}(\boldsymbol{o});\ \boldsymbol{% \rho}_{o}=\boldsymbol{D}_{o}(\boldsymbol{\theta}_{o})~{},bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = bold_italic_O start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT ( bold_italic_o ) ; bold_italic_ρ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = bold_italic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) , (9)
𝜽w=𝑾proj(𝒘);𝝆w=𝑫w(𝜽w),formulae-sequencesubscript𝜽𝑤subscript𝑾𝑝𝑟𝑜𝑗𝒘subscript𝝆𝑤subscript𝑫𝑤subscript𝜽𝑤\boldsymbol{\theta}_{w}=\boldsymbol{W}_{proj}(\boldsymbol{w});\ \boldsymbol{% \rho}_{w}=\boldsymbol{D}_{w}(\boldsymbol{\theta}_{w})~{},bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT ( bold_italic_w ) ; bold_italic_ρ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = bold_italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , (10)

where 𝜽osubscript𝜽𝑜\boldsymbol{\theta}_{o}bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and 𝜽wsubscript𝜽𝑤\boldsymbol{\theta}_{w}bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT represent the projection outcomes, while the reconstruction process produces 𝝆osubscript𝝆𝑜\boldsymbol{\rho}_{o}bold_italic_ρ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and 𝝆wsubscript𝝆𝑤\boldsymbol{\rho}_{w}bold_italic_ρ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Then the training objectives of unseen classifier include Cross Entropy Loss 𝓛xesubscript𝓛𝑥𝑒\boldsymbol{\mathcal{L}}_{xe}bold_caligraphic_L start_POSTSUBSCRIPT italic_x italic_e end_POSTSUBSCRIPT:

𝓛xe=1ninlog(exp(𝜽wyis𝜽oi)jCsexp(𝜽wyjs𝜽oi)),subscript𝓛𝑥𝑒1𝑛superscriptsubscript𝑖𝑛subscript𝜽subscript𝑤superscriptsubscript𝑦𝑖𝑠subscript𝜽subscript𝑜𝑖superscriptsubscript𝑗subscript𝐶𝑠subscript𝜽subscript𝑤superscriptsubscript𝑦𝑗𝑠subscript𝜽subscript𝑜𝑖\boldsymbol{\mathcal{L}}_{xe}=-\frac{1}{n}\sum_{i}^{n}\log\left(\frac{\exp% \left(\boldsymbol{\theta}_{w_{y_{i}^{s}}}\boldsymbol{\theta}_{o_{i}}\right)}{% \sum_{j}^{C_{s}}\exp\left(\boldsymbol{\theta}_{w_{y_{j}^{s}}}\boldsymbol{% \theta}_{o_{i}}\right)}\right)~{},bold_caligraphic_L start_POSTSUBSCRIPT italic_x italic_e end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( bold_italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( bold_italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG ) , (11)

where yissuperscriptsubscript𝑦𝑖𝑠y_{i}^{s}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is the label of seen sample i𝑖iitalic_i, and 𝜽wyissubscript𝜽subscript𝑤superscriptsubscript𝑦𝑖𝑠\boldsymbol{\theta}_{w_{y_{i}^{s}}}bold_italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the 𝜽wsubscript𝜽𝑤\boldsymbol{\theta}_{w}bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT-projection of text embedding belonging to seen class yissuperscriptsubscript𝑦𝑖𝑠y_{i}^{s}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. n𝑛nitalic_n and Cssubscript𝐶𝑠C_{s}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represent the number of training samples and seen categories, respectively. Another loss function is Reconstruction Loss 𝓛recsubscript𝓛𝑟𝑒𝑐\boldsymbol{\mathcal{L}}_{rec}bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT to minimize the discrepancy between 𝝆𝝆\boldsymbol{\rho}bold_italic_ρ and text embeddings 𝒘𝒘\boldsymbol{w}bold_italic_w with MSE (mean squared error):

𝓛rec=1nin[(𝝆oi𝒘i)2+(𝝆wi𝒘i)2].subscript𝓛𝑟𝑒𝑐1𝑛superscriptsubscript𝑖𝑛delimited-[]superscriptsubscript𝝆subscript𝑜𝑖subscript𝒘𝑖2superscriptsubscript𝝆subscript𝑤𝑖subscript𝒘𝑖2\boldsymbol{\mathcal{L}}_{rec}=\frac{1}{n}\sum_{i}^{n}\left[{\left(\boldsymbol% {\rho}_{o_{i}}-\boldsymbol{w}_{i}\right)}^{2}+{\left(\boldsymbol{\rho}_{w_{i}}% -\boldsymbol{w}_{i}\right)}^{2}\right]~{}.bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ ( bold_italic_ρ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( bold_italic_ρ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (12)

Moreover, a Regression Loss 𝓛regsubscript𝓛𝑟𝑒𝑔\boldsymbol{\mathcal{L}}_{reg}bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT calculated by MSE function between 𝜽osubscript𝜽𝑜\boldsymbol{\theta}_{o}bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and 𝜽wsubscript𝜽𝑤\boldsymbol{\theta}_{w}bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is defined as:

𝓛reg=1nin(𝜽oi𝜽wi)2.subscript𝓛𝑟𝑒𝑔1𝑛superscriptsubscript𝑖𝑛superscriptsubscript𝜽subscript𝑜𝑖subscript𝜽subscript𝑤𝑖2\boldsymbol{\mathcal{L}}_{reg}=\frac{1}{n}\sum_{i}^{n}{\left(\boldsymbol{% \theta}_{o_{i}}-\boldsymbol{\theta}_{w_{i}}\right)}^{2}~{}.bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

And the overall loss for unseen classifier is defined as:

𝓛total=𝓛xe+𝓛rec+𝓛reg.subscript𝓛𝑡𝑜𝑡𝑎𝑙subscript𝓛𝑥𝑒subscript𝓛𝑟𝑒𝑐subscript𝓛𝑟𝑒𝑔\boldsymbol{\mathcal{L}}_{total}=\boldsymbol{\mathcal{L}}_{xe}+\boldsymbol{% \mathcal{L}}_{rec}+\boldsymbol{\mathcal{L}}_{reg}~{}.bold_caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = bold_caligraphic_L start_POSTSUBSCRIPT italic_x italic_e end_POSTSUBSCRIPT + bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + bold_caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT . (14)

Following the original loss function design, there is no weight parameter applied on final loss 𝓛totalsubscript𝓛𝑡𝑜𝑡𝑎𝑙\boldsymbol{\mathcal{L}}_{total}bold_caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT.

During test phase, the classification result is determined by the nearest neighbor principle which means class text embedding closest to sample feature projection 𝜽oisuperscriptsubscript𝜽𝑜𝑖\boldsymbol{\theta}_{o}^{i}bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is selected as the predicted label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

ci=argmin𝑗(𝜽wj𝜽oi2),subscript𝑐𝑖𝑗argminsubscriptnormsuperscriptsubscript𝜽𝑤𝑗superscriptsubscript𝜽𝑜𝑖2c_{i}=\underset{j}{\operatorname{argmin}}\left(\left\|\boldsymbol{\theta}_{w}^% {j}-\boldsymbol{\theta}_{o}^{i}\right\|_{2}\right)~{},italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = underitalic_j start_ARG roman_argmin end_ARG ( ∥ bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (15)

where 𝜽wjsuperscriptsubscript𝜽𝑤𝑗\boldsymbol{\theta}_{w}^{j}bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the encoded text embeddings corresponding to class i𝑖iitalic_i.

Methods VGGSound-GZSLmain UCF-GZSLmain ActivityNet-GZSLmain
acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT 𝑯𝑯\boldsymbol{H}bold_italic_H accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT
CJME [28] 8.69 4.78 6.17 5.16 26.04 8.21 12.48 8.29 5.55 4.75 5.12 5.84
AVGZSLNet [21] 18.15 3.48 5.83 5.28 52.52 10.90 18.05 13.65 8.93 5.04 6.44 5.40
TCaF [24] 9.64 5.91 7.33 6.06 58.60 21.74 31.72 24.81 18.70 7.50 10.71 7.91
VIB-GZSL [17] 18.42 6.00 9.05 6.41 90.35 21.41 34.62 22.49 22.12 8.94 12.73 9.29
AVMST [15] 14.14 5.28 7.68 6.61 44.08 22.63 29.91 28.19 17.75 9.90 12.71 10.37
MDFT [16] 16.14 5.97 8.72 7.13 48.79 23.11 31.36 31.53 18.32 10.55 13.39 12.55
Hyper-multiple [12] 15.02 6.75 9.32 7.97 63.08 19.10 29.32 22.24 23.38 8.67 12.65 9.50
AVCA [25] 14.90 4.00 6.31 6.00 51.53 18.43 27.15 20.01 24.86 8.02 12.13 9.13
OOD-entropy+AVCA [35] 13.31 7.01 9.19 7.48 63.94 26.99 37.96 30.56 29.84 9.54 14.46 11.41
EZ-OOD+AVCA (Ours) 24.94 6.38 10.16 7.48 79.71 27.94 41.38 30.56 30.65 9.29 14.26 11.41
Table 2: Compatibility of EZ-OOD with existing method. We make comparisons between existing state-of-the-art AV-GZSL methods and our new model on VGGSound-GZSLmain, UCF-GZSLmain, and ActivityNet-GZSLmain datasets. GZSL (acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT/acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT/𝑯𝑯\boldsymbol{H}bold_italic_H) and ZSL (accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT) performances are reported in percentage. Bold numbers denote the best results and the second highest values are underlined.

4 Experiments and Results Analysis

4.1 Setup for Audio-visual GZSL

4.1.1 Datasets and Evaluation Metrics

Following experimental setting in AVCA [25], we adopt the curated version of three audio-visual datasets: VGGSound [5], UCF101 [30], and ActivityNet [3] to evaluate our EZ-AVOOD method and they are VGGSound-GZSLcls, UCF-GZSLcls and ActivityNet-GZSLcls. The upper cls𝑐𝑙𝑠clsitalic_c italic_l italic_s represents cls𝑐𝑙𝑠clsitalic_c italic_l italic_s-split𝑠𝑝𝑙𝑖𝑡splititalic_s italic_p italic_l italic_i italic_t introduced in [24] instead of main𝑚𝑎𝑖𝑛mainitalic_m italic_a italic_i italic_n-split𝑠𝑝𝑙𝑖𝑡splititalic_s italic_p italic_l italic_i italic_t utilized in [25].

Consistent with ZSL conventions [36, 25], we adopt the average per-class classification accuracy as the evaluation metric, where acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT and acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT denotes the mean class accuracy of seen classes and unseen classes separately. To comprehensively evaluate GZSL performance, the harmonic mean 𝑯𝑯\boldsymbol{H}bold_italic_H of seen and unseen accuracy is calculated as:

𝑯=2acc𝑺acc𝑼acc𝑺+acc𝑼.𝑯2𝑎𝑐subscript𝑐𝑺𝑎𝑐subscript𝑐𝑼𝑎𝑐subscript𝑐𝑺𝑎𝑐subscript𝑐𝑼\boldsymbol{H}=\frac{2*acc_{\boldsymbol{S}}*acc_{\boldsymbol{U}}}{acc_{% \boldsymbol{S}}+acc_{\boldsymbol{U}}}~{}.bold_italic_H = divide start_ARG 2 ∗ italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ∗ italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT + italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT end_ARG . (16)

For ZSL tasks aimed to classify unseen samples only, mean class accuracy accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT is also obtained.

4.1.2 Implementation Details

OOD Detector  The trained seen classifier is transformed into our OOD detector to produce logits and energy score. As for residual score, the dimension N of principal subspace and the scaling factor γ𝛾\gammaitalic_γ are valued at 64/90 for VGGSound-GZSLcls, 256/205 for UCF-GZSLcls and 256/285 for ActivityNet-GZSLcls.

More details about Feature Extractor, Seen Classifier, and Unseen Classifier are provided in supplementary material.

4.2 Experimental Results

4.2.1 Quantitative Results

As shown in Table 1, our EZ-AVOOD model consistently takes the lead on all three benchmarks in terms of both harmonic mean 𝑯𝑯\boldsymbol{H}bold_italic_H for GZSL task and accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT under ZSL setup. For VGGSound-GZSLcls dataset, we achieve the best performances on all metrics and specially EZ-AVOOD considerably outperforms current state-of-the-art ClipClap with the lead of 9.65%@acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT, 0.72%@acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT, 2.03%@𝑯𝑯\boldsymbol{H}bold_italic_H, and 1.75%@accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT. In addition, our method substantially overtakes the ClipClap on UCF-GZSLcls benchmark with even bigger lead margins of 6.39%@acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT, 4.10%@acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT, 5.00%@𝑯𝑯\boldsymbol{H}bold_italic_H, and 3.96%@accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT respectively. Though the proposed EZ-AVOOD “merely” takes the second place on acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT and acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT of ActivityNet-GZSLcls, in which ClipClap and Hyper-multiple ranks the top separately, our method does holds the first place on the more comprehensive metric 𝑯𝑯\boldsymbol{H}bold_italic_H and attains significant performance on accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT better than all other baseline methods with a 2.44% lead margin at least.

4.3 Compatibility of EZ-OOD with Existing Method

4.3.1 Experimental Setup

Here, we replace the unseen classifier of EZ-AVOOD with AVCA [25] and explore the new model’s experimental performance. We provide some key details of this experiment: main𝑚𝑎𝑖𝑛mainitalic_m italic_a italic_i italic_n-split𝑠𝑝𝑙𝑖𝑡splititalic_s italic_p italic_l italic_i italic_t of three datasets is adopted; audio features and visual features are extracted by self-supervised SeLaVi [2] pre-trained on VGGSound dataset; and text embeddings are obtained using word2vec model [26] pre-trained with Wikipedia.

4.3.2 Baseline Method

AV-OOD[35] is another OOD-based method that takes the AVCA model as unseen expert and proposes the OOD-entropy method for OOD detection, consequently quite suitable for making contrasts with our method right here. To ensure a just comparison, we only train the seen classifier (also working as the OOD detector) and utilize the same unseen classifier as AV-OOD method. Notably, we re-run the AV-OOD with the provided checkpoint files and get ZSL and GZSL performances to facilitate fair comparison. As can be seen in Table 2, the accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT results of our method and OOD-entropy are totally the same on three benchmarks.

Methods VGGSound-GZSLcls UCF-GZSLcls ActivityNet-GZSLcls
AUROC FPR95 AUPR 𝑯𝑯\boldsymbol{H}bold_italic_H AUROC FPR95 AUPR 𝑯𝑯\boldsymbol{H}bold_italic_H AUROC FPR95 AUPR 𝑯𝑯\boldsymbol{H}bold_italic_H
\uparrow \downarrow \uparrow \uparrow \uparrow \downarrow \uparrow \uparrow \uparrow \downarrow \uparrow \uparrow
Residual Score 68.73 85.74 85.51 14.91 91.30 42.09 90.96 58.65 67.30 85.28 44.73 25.34
Energy Score 81.97 67.69 92.65 17.84 80.41 74.09 86.55 47.57 70.99 86.95 54.57 26.13
EZ-OOD (full) 84.33 66.24 93.82 18.21 95.35 33.87 96.01 60.97 77.57 80.61 63.62 27.95
Table 3: Ablation studies on EZ-OOD method. We make a comparison with Energy Score, Resdual Score, and their γ𝛾\gammaitalic_γ-weighted sum the full EZ-OOD on VGGSound-GZSLcls, UCF-GZSLcls and ActivityNet-GZSLcls datasets. Out-of-distribution detection metrics (AUROC/FPR95/AUPR) and GZSL (harmonic mean 𝑯𝑯\boldsymbol{H}bold_italic_H) performance are reported in percentage. \downarrow indicates that lower results are better while \uparrow means the opposite. Bold values denote the best results and the second-best outcomes are underlined.
Refer to caption
Figure 4: ROC curves of EZ-OOD, Energy Score, and Residual Score on three datasets. Evidently, the full EZ-OOD consistently outperforms Energy Score and Residual Score with larger AUROC metric.

4.3.3 Quantitative Results

In the first place, compared with baseline method AVCA, the new model has secured a full-fledged lead on all metrics of three different datasets. More accurately, the greatest lead margins can amount to 28.18%@acc𝑺𝑎𝑐subscript𝑐𝑺acc_{\boldsymbol{S}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT, 9.51%@acc𝑼𝑎𝑐subscript𝑐𝑼acc_{\boldsymbol{U}}italic_a italic_c italic_c start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT, 14.23%@𝑯𝑯\boldsymbol{H}bold_italic_H, and 10.55%@accZSL𝑎𝑐subscript𝑐𝑍𝑆𝐿acc_{ZSL}italic_a italic_c italic_c start_POSTSUBSCRIPT italic_Z italic_S italic_L end_POSTSUBSCRIPT on UCF-GZSLmain dataset. In addition, our new model significantly increases the ZSL and GZSL performances on the other 2 benchmarks from the baseline which comprehensively verifies the compatibility of the proposed EZ-OOD method.

Secondly, compared with OOD-entropy+AVCA [35], our EZ-OOD leveraging the same unseen classifier attains a remarkable lead on the harmonic mean metric of both VGGSound-GZSLmain and UCF-GZSLmain datasets (0.97% and 3.42% separately), and lags behind by merely 0.2%@𝑯𝑯\boldsymbol{H}bold_italic_H on ActivityNet-GZSLmain benchmark. Notably, the OOD detection performance of our method is ahead of OOD-entropy actually on ActivityNet benchmark which is illustrated in the supplementary material. Since AV-GZSL evaluates the average per-class classification accuracy, while OOD detection simply considers ID-OOD separation of all test samples without caring about class labels, as a result, better but close OOD performance not always brings stronger GZSL performance.

4.4 Ablation Studies

To gain an insight into the concrete effect of Energy Score and Residual Score, we conduct additional ablation studies to compare the OOD detection performance and AV-GZSL results within EZ-OOD and its two key components. Experimental setup is consistent with the proposed EZ-AVOOD model in Experiment 4.2. Here in Table 3 we report the AUROC (Area Under the Receiver Operating Characteristic curve), FPR95 (FPR@TPR95), and AUPR (Area Under the Precision versus Recall curve) to evaluate OOD detection capability as well as the harmonic mean 𝑯𝑯\boldsymbol{H}bold_italic_H of GZSL task on three datasets. Also, we draw the ROC curves belonging to the three methods to explicitly display their OOD detection performance on each benchmark in Figure 4.

Refer to caption
Figure 5: Effect of scaling factor γ𝛾\gammaitalic_γ on AUROC for three datasets. OOD detection performance of EZ-OOD reaches the top when energy score and residual score are properly matched with the linear combination scaled by a suitable γ𝛾\gammaitalic_γ.

4.4.1 Quantitative Results and Qualitative Results

According to the results in Table 3, the full EZ-OOD undoubtedly takes the first place on all three OOD detection metrics with the highest AUROC and AUPR and the lowest FPR95 and naturally achieves the best 𝑯𝑯\boldsymbol{H}bold_italic_H for GZSL. In terms of the two components, Energy Score ranks second on VGGSound-GZSLcls and ActivityNet-GZSLcls datasets and Residual Score attains a remarkable lead over Energy Score on UCF-GZSLcls benchmark. Moreover, we observe that the harmonic mean 𝑯𝑯\boldsymbol{H}bold_italic_H produce by Energy Score@VGGSound-GZSLcls (17.84%) and Residual Score@UCF-GZSLcls (58.65%) effortlessly defeat all the contrasting methods in Table 1. Therefore, we conclude that both Energy Score and Residual Score play a vital role in separating seen and unseen samples to facilitate subsequent ZSL classification objectives.

Figure 4 depicts the AUROC discrepancy between EZ-OOD score and its two crucial components on three benchmarks. In addition, two individual OOD scores manifest competitive detection performance whose ROC curves are close to the upper-left in the graph. To sum up, the proposed OOD score effectively combines the strengths of the two powerful scores with the weighted sum to achieve stronger OOD detection performance and higher GZSL classification accuracy.

4.5 Parameter Sensitivity Studies

4.5.1 Effect of Scaling Factor γ𝛾\gammaitalic_γ

Here we test the scaling factor γ𝛾\gammaitalic_γ from the set: γ{0.1,1,10,100,250,500,1000}𝛾0.11101002505001000\gamma\in\{0.1,1,10,100,250,500,1000\}italic_γ ∈ { 0.1 , 1 , 10 , 100 , 250 , 500 , 1000 } with a fixed N specific to each dataset. Typically, better OOD detection performance will bring higher 𝑯𝑯\boldsymbol{H}bold_italic_H for GZSL task, hence, AUROC is evaluated to avoid extra computational burden instead of 𝑯𝑯\boldsymbol{H}bold_italic_H. We follow the same experimental setup in Section 4.2. Figure 5 illustrates the AUROC on three datasets with different scaling factors γ𝛾\gammaitalic_γ. When the scaling factor is valued at 0.1 or 1000, the linear combination EZ-OOD score will reduce to ordinary energy score or individual residual score, resulting in lower AUROC at both ends of the curve, which is consistent with the outcomes in ablation studies. Moreover, our method is capable of effectively integrating the discriminative information held by class-specific energy score and class-agnostic residual score under a wide range of scaling factors to achieve enhanced OOD detection performance and better audio-visual GZSL results than individual scores.

Refer to caption
Figure 6: Effect of principal subspace dimension N on AUROC for three datasets. OOD detection performance of EZ-OOD method is quite robust with principal subspace dimension since the AUROC changes little towards a wide range of N values.

4.5.2 Effect of Principle Subspace Dimension N

Different N parameters will have a direct influence on the OOD detection performance of residual score, followed by producing different EZ-OOD scores and finally change the overall audio-visual GZSL results. Additionally, since the concatenated audio-visual feature is 1536-d and here we adopt different N as 32, 64, 128, 256, 384, 512 and 768 together with a fixed γ𝛾\gammaitalic_γ for each benchmark, and the AUROC value is reported as the evaluation metric. As depicted in Figure 6, the fitted curves reach the peak at N=64𝑁64N=64italic_N = 64, N=256𝑁256N=256italic_N = 256 and N=256𝑁256N=256italic_N = 256 for VGGSound-GZSLcls, UCF-GZSLcls and ActivityNet-GZSLcls benchmarks respectively and are generally “flat” which indicates the proposed EZ-OOD method is less sensitive with the dimension N. As a result, we can select this hyperparameter from a wide range of numbers with little influence on the final GZSL performance, which convincingly validates the excellent robustness of our method.

5 Conclusion

In this paper, we propose an extremely simple OOD detection based model EZ-AVOOD for Audio-Visual Generalized Zero-Shot Learning (AV-GZSL) by ingeniously integrating the discriminative information held by class-specific logits and class-agnostic feature subspace. Superior experimental results on 3 audio-visual datasets fully demonstrate the effectiveness of our model. Moreover, the excellent compatibility of the proposed OOD detection method EZ-OOD is verified through deploying a different unseen classifier to construct a new model that outperforms the contrasting methods on both OOD detection performance and GZSL classification accuracy. Therefore, we conclude that EZ-AVOOD is new state-of-the-art of AV-GZSL.

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223. PMLR, 2017.
  • Asano et al. [2020] Yuki Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision. NeurIPS, 33:4660–4671, 2020.
  • Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
  • Chao et al. [2016] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, pages 52–68. Springer, 2016.
  • Chen et al. [2020] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP, pages 721–725. IEEE, 2020.
  • Chen et al. [2023a] Yiye Chen, Yunzhi Lin, Ruinian Xu, and Patricio A Vela. Wdiscood: Out-of-distribution detection via whitened linear discriminant analysis. In ICCV, pages 5298–5307, 2023a.
  • Chen et al. [2023b] Zailong Chen, Lei Wang, Peng Wang, and Peng Gao. Question-aware global-local video understanding network for audio-visual question answering. IEEE TCSVT, 2023b.
  • Djurisic et al. [2022] Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. arXiv preprint arXiv:2209.09858, 2022.
  • Dong et al. [2022] Guan-Nan Dong, Chi-Man Pun, and Zheng Zhang. Temporal relation inference network for multimodal speech emotion recognition. IEEE TCSVT, 32:6472–6485, 2022.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. NeurIPS, 30, 2017.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • Hong et al. [2023] Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In ICCV, pages 7873–7883, 2023.
  • Huang and Li [2021] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, pages 8710–8719, 2021.
  • Kurzendörfer et al. [2024] David Kurzendörfer, Otniel-Bogdan Mercea, A Koepke, and Zeynep Akata. Audio-visual generalized zero-shot learning using pre-trained large multi-modal models. In CVPR, pages 2627–2638, 2024.
  • Li et al. [2023a] Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, and Xiaopeng Fan. Modality-fusion spiking transformer network for audio-visual zero-shot learning. In ICME, pages 426–431. IEEE, 2023a.
  • Li et al. [2023b] Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio-visual zero-shot learning. In ACM Multimedia, pages 3994–4002, 2023b.
  • Li et al. [2023c] Yapeng Li, Yong Luo, and Bo Du. Audio-visual generalized zero-shot learning based on variational information bottleneck. In ICME, pages 450–455. IEEE, 2023c.
  • Liang et al. [2017] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
  • Lin et al. [2021] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution detection. In CVPR, pages 15313–15323, 2021.
  • Liu et al. [2020] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. NeurIPS, 33:21464–21475, 2020.
  • Mazumder et al. [2021] Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, and Vinay P Namboodiri. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In WACV, pages 3090–3099, 2021.
  • Mei et al. [2022] Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D Plumbley, and Wenwu Wang. Diverse audio captioning via adversarial training. In ICASSP, pages 8882–8886. IEEE, 2022.
  • Mei et al. [2024] Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  • Mercea et al. [2022a] Otniel-Bogdan Mercea, Thomas Hummel, A Sophia Koepke, and Zeynep Akata. Temporal and cross-modal attention for audio-visual zero-shot learning. In ECCV, pages 488–505. Springer, 2022a.
  • Mercea et al. [2022b] Otniel-Bogdan Mercea, Lukas Riesch, A Koepke, and Zeynep Akata. Audio-visual generalised zero-shot learning with cross-modal attention and language. In CVPR, pages 10553–10563, 2022b.
  • Mikolov et al. [2017] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405, 2017.
  • Mo and Morgado [2025] Shentong Mo and Pedro Morgado. Audio-visual generalized zero-shot learning the easy way. In ECCV, pages 377–395. Springer, 2025.
  • Parida et al. [2020] Kranti Parida, Neeraj Matiyali, Tanaya Guha, and Gaurav Sharma. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In WACV, pages 3251–3260, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  • Soomro [2012] K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • Sperber et al. [2019] Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics, 7:313–325, 2019.
  • Sun et al. [2022] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In ICML, pages 20827–20840. PMLR, 2022.
  • Vaswani [2017] A Vaswani. Attention is all you need. NeurIPS, 2017.
  • Wang et al. [2023] Lanxiao Wang, Heqian Qiu, Benliu Qiu, Fanman Meng, Qingbo Wu, and Hongliang Li. Tridentcap: Image-fact-style trident semantic framework for stylized image captioning. IEEE TCSVT, 2023.
  • Wen [2024] Liuyuan Wen. Out-of-distribution detection for audio-visual generalized zero-shot learning: A general framework. arXiv preprint arXiv:2408.01284, 2024.
  • Xian et al. [2018] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 41(9):2251–2265, 2018.
  • Xie et al. [2019] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  • Ye et al. [2016] Qiaolin Ye, Jian Yang, Fan Liu, Chunxia Zhao, Ning Ye, and Tongming Yin. L1-norm distance linear discriminant analysis based on an effective iterative algorithm. IEEE TCSVT, 28(1):114–129, 2016.
  • Zhang and Lu [2018] Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In ECCV, pages 686–701, 2018.
  • Zheng et al. [2023] Qichen Zheng, Jie Hong, and Moshiur Farazi. A generative approach to audio-visual generalized zero-shot learning: Combining contrastive and discriminative techniques. In IJCNN, pages 1–8. IEEE, 2023.