Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

Yang Liu¹, Xun Zhang¹, Jiale Du¹, Xinbo Gao^1,2, Jungong Han³
¹Xidian University, Xi’an, China
²Chongqing University of Posts and Telecommunications, Chongqing, China
³Tsinghua University, Beijing, China
yangl@xidian.edu.cn, xunz724@gmail.com, 23011211070@stu.xidian.edu.cn,
xbgao@mail.xidian.edu.cn, jungonghan77@gmail.com

Abstract

Zero-shot Learning (ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning (AV-GZSL) has aroused researchers’ great interest in which intricate relations within triple modalities (audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution (OOD) detection based AV-GZSL method (EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

1 Introduction

Multimodal learning has become one of the most trending research topics nowadays involving vision-language tasks including visual question answering [7], image captioning [34], cross-modal retrieval [39], visual entailment [37], and so on, and audio-language tasks like audio captioning [22], emotion recognition [9], speech translation [31], etc., since vision, audio, and language are the most common signal sources in real world. However, it is undoubtedly an overwhelming burden to collect and label quantities of videos/pictures, audio signals, and natural language corpus to fulfill aforementioned tasks, where Zero-shot Learning (ZSL) emerges as an attainable approach to get rid of redundant data collection by mining auxiliary information like semantic attributes and word embeddings to achieve knowledge transfer without contact with unseen samples. Concerning video-audio multimodal tasks, AV-GZSL dependent on the fusion of audio and visual input, and natural language description is the chosen methodology to cope with classification, retrieval, and other problems.

Refer to caption — Figure 1: Harmonic mean (%) evaluating GZSL performance of our EZ-AVOOD model and other comparison methods on three datasets. EZ-AVOOD (the red bar) consistently outperforms the rest opponents with a lead margin up to 5% on the UCF-GZSL benchmark.

[28] sets out to apply GZSL to audio-visual features. AVGZSLNet [21] leverages late fusion for integrating information from bi-modalities. Most of the subsequent works have focused on efficient fusion of audio-visual feature representations: AVCA [25] initially utilizes cross-attention [33] block which is inherited by [24, 12, 27]. Additionally, generative approach AVFS [40] facilitates contrastive learning with the synthesized negative unseen samples. However, domain shift problem is always unavoidable for ZSL with the bias towards seen classes, AV-GZSL simply behaves the same way, especially for embedding-based approaches. Generative solutions aim to reduce the bias with synthesized unseen samples and Calibrated Stacking [4] searches extra hyper-parameters to suppress the model’s tendency towards seen samples. Considering that generative model is unstable to train while Calibrated Stacking’s effect is limited, in this paper we propose an extremely simple OOD detection based AV-GZSL method named EZ-AVOOD to effectively alleviate the bias problem in an explicit way. Figure 1 fully exhibits the competence of our model where our model outperforms all the contrasting methods on three audio-visual benchmarks.

Typically, the fundamental framework of OOD approaches in ZSL shares the universal three parts: an OOD detector, one seen classifier and an unseen classifier, and each of them plays a vital role in the final GZSL performance. Additionally, these three components indeed work quite independently and consequently they are relatively substitutable when stronger components are developed. Unlike previous work which utilizes generative method like WGAN-GP [1, 10] to first synthesize unseen samples features, and then jointly train a completely new OOD detector with seen samples and synthesized unseen samples [35], the proposed EZ-AVOOD model accomplishes OOD detection without training the OOD detector with the aid of class-specific logits produced by supervised seen classifier and class-agnostic information hidden in feature subspace. More specifically, our method only needs to train the seen classifier composed of one MLP (Multilayer Perceptrons), and fine-tune an existing embedding-based method [14] as the unseen expert classifier, which considerably reduces the complexity of the entire model. Moreover, the unseen classifier of our EZ-AVOOD can be substituted by arbitrary optimal AV-GZSL method for future researchers to achieve higher overall GZSL performance. The main contributions of this paper can be summarized into the following three aspects:

•

We proposed an extremely simple OOD detection based model EZ-AVOOD to address AV-GZSL problem with OOD score derived from class-specific logits and class-agnostic feature subspace instead of training a completely new OOD detector.
•

We comprehensively demonstrate the effectiveness of EZ-AVOOD model through practical ZSL and GZSL experiments on three different audio-visual benchmarks and observe a substantial enhancement in comparison with the current state-of-the-art methodologies.
•

The proposed effective OOD detection method EZ-OOD possesses strong compatibility with existing AV-GZSL approaches, which indicates that future researchers could effortlessly improve the GZSL performance by simply replacing the unseen expert with more powerful substitutes.

2 Related Work

2.1 Audio-visual Generalized Zero-shot Learning

Embedding approaches mapping the videos, audio signals, and text information into a shared common space to get joint feature representations aimed for subsequent classification or retrieval tasks enjoy wide popularity in AV-GZSL. Among them, prior CJME [28] employs triplet loss to restrict the distance between audio-visual features and class embedding on the proposed AudioSetZSL [28] dataset. Recently, TCaF [24] proposes a temporal cross-attention framework enhanced from AVCA [25] by factoring in the temporal information. Hyperbolic [12] method employs hyperbolic alignment loss and cross-attention module to further improve the separability of joint feature representations. EZ-AVGZL [27] utilizes class embedding optimization to achieve better discriminability of class embeddings while maintaining their original semantics. ClipClap [14] enhances the feature quality by pre-trained CLIP [29] and CLAP [23] models. As for the generative method, AVFS [40](Audio-Visual Feature Synthesis) generates unseen samples to facilitate contrastive training.

2.2 Out-of-distribution Detection and Post-hoc Methods

Out-of-distribution (OOD) detection is proposed to secure the smooth deployment of machine learning models in real-world scenarios since these models are typically trained and validated in close-world settings. OOD detection is implemented with different OOD scores (scalars) produced by ID (in-distribution) and OOD samples on which certain thresholds are applied to finish ID-OOD separation.

Post-hoc methods for OOD detection typically involve training a model or a classifier with ID data first and subsequently, the model with frozen parameters is converted into an OOD detector during the test stage, which is cost-effective compared with training an extra detector. Post-hoc methods compute an OOD score based on the pre-trained model’s output. One simple solution MSP [11] utilizes the maximum softmax probability as the OOD score to distinguish ID and OOD samples. To improve the discriminability of the softmax score, ODIN [18] employs temperature scaling to make the softmax score distribution more uniform and also applies input perturbation. Logits-based Energy score [20, 19] makes use of output logits in conjunction with the LogSumExp function. Features-based methods [8, 32, 13] like WDiscOOD [6] leverages LDA (linear discriminant analysis) [38] to enlarge the inter-class discrepancy and reduce the intra-class gap for better ID-OOD separation.

3 Proposed Approach

3.1 Problem Statement of Audio-visual GZSL

AV-GZSL aims to efficiently recognize previously seen and even unseen video-audio combinations through the set of seen (training) audio-visual events together with human-readable text descriptions. Thus, we denote samples from seen classes with $\boldsymbol{S}=(\boldsymbol{a}_{i}^{s},\boldsymbol{v}_{i}^{s},\boldsymbol{t}_{% i}^{s},y_{i}^{s})_{i\in\{1,\cdots,M\}}$ , where $M$ seen samples in total consist of audio features $\boldsymbol{a}_{i}^{s}$ , visual features $\boldsymbol{v}_{i}^{s}$ and textual description $\boldsymbol{t}_{i}^{s}$ as well as the corresponding ground truth class label $y_{i}^{s}$ . Likewise, unseen dataset is denoted as $\boldsymbol{U}=(\boldsymbol{a}_{j}^{u},\boldsymbol{v}_{j}^{u},\boldsymbol{t}_{% j}^{u},y_{j}^{u})_{j\in\{1,\cdots,K\}}$ with $K$ samples and notably ${\boldsymbol{S}}\cap{\boldsymbol{U}}=\varnothing$ . The number of seen classes labels is $C_{s}$ : ${\boldsymbol{Y}^{s}}={({y_{1}^{s}},\cdots,{y}_{M}^{s})}\in{\{{1,\cdots,C_{s}}\}}$ , and the number of unseen classes is $C_{u}$ : ${\boldsymbol{Y}^{u}}={({y_{1}^{u}},\cdots,{y}_{K}^{u})}\in{\{{1,\cdots,C_{u}}\}}$ . Hence,for the ZSL task, AV-GZSL learns the model $f_{ZSL}:\boldsymbol{X}\rightarrow{{{\boldsymbol{Y}}^{u}}}$ to classify unseen samples only, where $\boldsymbol{X}=(\boldsymbol{a}_{z},\boldsymbol{v}_{z})$ denotes the test dataset and in GZSL, the classifier $f_{GZSL}:\boldsymbol{X}\rightarrow{\boldsymbol{Y}^{u}\cup\boldsymbol{Y}^{s}}$ aims to classify seen and unseen examples.

3.2 Model Architecture

The proposed model is depicted in Figure 2 and there are three essential components in our model: the OOD detector, the supervised seen classifier, and the unseen classifier adapted from an existing embedding method. The greatest strength of our method is that there is no need to train a new OOD detector, since our OOD detector shares the same parameters with the seen classifier, which means the supervised classifier trained with seen samples also serves as the OOD detector. Therefore, EZ-AVOOD significantly reduces the complexity of the model and brings a considerable decrease in the computational overhead and training time.

3.2.1 Seen Classifier

To cope with seen samples, a vanilla and efficient 3-layer MLP optimized with Cross Entropy Loss $\boldsymbol{\mathcal{L}}_{xent}=\textbf{CrossEntropy}(\boldsymbol{x},y(% \boldsymbol{x}))$ is adopted as the seen expert classifier, where $\boldsymbol{x}$ denotes the joint audio-visual features from seen classes constructed through simple concatenation: $\boldsymbol{x}=\boldsymbol{a}\oplus\boldsymbol{v}$ . Moreover, once the seen classifier is trained, it can be leveraged as the OOD detector, and more details are thoroughly elaborated in the next part.

3.2.2 Out-of-distribution Detector

We adopt post-hoc idea to design OOD algorithm tackling seen-unseen separation in AV-GZSL problem. Since the output of pre-trained model usually includes high-dimensional features, logits, or Softmax probability and we choose to exploit the intrinsic information held in class-specific logits and feature representation to construct OOD score. Consequently, the trained seen classifier now becomes the “OOD detector” in our method to output class-dependent logits. The proposed extremely simple OOD detection method is named as “EZ-OOD” and the formulation pipeline is illustrated in Figure 3. The EZ-OOD score consists of Energy Score calculated by logits and Residual Score derived from residual subspace.

Class-specific Logits and Energy Score Seen classifier takes the high-dimensional fused audio-visual features as input and produces the logits corresponding to specific seen classes labels. We adopt the widely used Energy Score function LogSumExp $\boldsymbol{E(x;\,l)}$ as one vital part of our OOD score, mapping the logits $\boldsymbol{l}$ of sample $\boldsymbol{x}$ to a scalar:

\boldsymbol{E}(\boldsymbol{x};\,\boldsymbol{l})=-\log\sum_{i=1}^{C}e^{l_{i}(% \boldsymbol{x})}~{},

(1)

where $C$ is the number of seen classes and $l_{i}(\boldsymbol{x})$ is the logit of class- $i$ in correspondence with sample $\boldsymbol{x}$ . Negative inversion $\boldsymbol{E}(\boldsymbol{x})=-\boldsymbol{E}(\boldsymbol{x};\,\boldsymbol{l})$ is the score practically used to ensure that seen samples produce higher scores, which is consistent with the tradition in OOD detection.

Feature Representation and Class-agnostic Residual Score Here $\boldsymbol{x}\in{\mathbb{R}}^{D}$ is the D-dimensional fused bi-modal sample feature and $\boldsymbol{X}$ denotes the audio-visual feature matrix of all seen samples. Therefore, the principal subspace ${P}$ is defined by the $N$ -dimensional space spanned by the eigenvetors corresponding to the top- $N$ eigenvalues of matrix $\boldsymbol{X}^{T}\boldsymbol{X}$ and the the residual subspace ${P}^{\perp}$ is the orthogonal complements of the space ${P}$ . Thus we have $\boldsymbol{x}=\boldsymbol{x}^{P}+\boldsymbol{x}^{{P}^{\perp}}$ where $\boldsymbol{x}^{P}$ is the projection of sample feature $\boldsymbol{x}$ onto subspace $P$ and $\boldsymbol{x}^{{P}^{\perp}}$ is the mapping on ${P}^{\perp}$ . Suppose the eigen-decomposition of matrix $\boldsymbol{X}^{T}\boldsymbol{X}$ is

\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{W{\Lambda}W}^{T}~{},

(2)

where $\boldsymbol{W}$ refers to a set of standard orthogonal bases that are arranged according to the decreasing order of the eigenvalues within the diagonal matrix $\boldsymbol{\Lambda}$ . $N$ -dimensional principal subspace $P$ is defined by the first $N$ column vectors of $\boldsymbol{W}$ , and span of the $(N+1)$ -th column to the $D$ -th column vectors in $\boldsymbol{W}$ is the residual subspace ${P}^{\perp}$ . Then the matrix $\boldsymbol{W}$ can be separated into matrix $\boldsymbol{Q}\in{\mathbb{R}}^{D\times{N}}$ and matrix $\boldsymbol{O}\in{\mathbb{R}}^{D\times(D-N)}$ formed by the last $(D-N)$ eigenvetors, and we can get

\boldsymbol{x}^{P}={\boldsymbol{Q}}^{T}\boldsymbol{x};\ \boldsymbol{x}^{{P}^{% \perp}}={\boldsymbol{O}}^{T}\boldsymbol{x}~{}.

(3)

Given that the principal subspace and the residual subspace are constructed based on the feature representations of all training samples, they just ignore the information specific to individual seen category, namely characteristics hidden in these two subspaces are class-agnostic. Moreover, we argue that seen samples are relatively closer to principal subspace while deviate a lot from residual subspace. Therefore we define the Residual Score $bm{R}(\boldsymbol{x})$ as the norm of $\boldsymbol{x}^{{P}^{\perp}}$ :

\boldsymbol{R}(\boldsymbol{x})=-\|\boldsymbol{x}^{{P}^{\perp}}\|=-\left(% \boldsymbol{x}^{T}{\boldsymbol{O}}{\boldsymbol{O}}^{T}\boldsymbol{x}\right)^{1% /2}~{}.

(4)

Just like the Energy Score, we take a minus norm to make ID samples produce higher OOD scores.

EZ-OOD Score Formulation The final OOD score $\boldsymbol{S}(\boldsymbol{x})$ is formulated by the weighted sum of energy score and residual score to unify the class-specific and class-agnostic information for better ID-OOD separation:

\boldsymbol{S}(\boldsymbol{x})=\boldsymbol{E}(\boldsymbol{x})+\gamma% \boldsymbol{R}(\boldsymbol{x})~{},

(5)

where $\gamma$ is the weight hyper-parameter to balance the scale of these two different scores and enhance the overall OOD detection performance.

OOD detection process is defined as below, $A$ is the binary classification outcome,

A_{\lambda}(\boldsymbol{x})=\begin{cases}\text{ Seen }&\boldsymbol{S}(% \boldsymbol{x})\geq\lambda\\ \text{ Unseen }&\boldsymbol{S}(\boldsymbol{x})<\lambda~{},\end{cases}

(6)

where $\lambda$ is the threshold and samples possess higher $\boldsymbol{S}(\boldsymbol{x})$ are tend to be treated as seen classes. The threshold is uniquely determined by the training samples and has nothing to do with the test data.

Methods	VGGSound-GZSL^cls				UCF-GZSL^cls				ActivityNet-GZSL^cls
Methods	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$
CJME [28]	11.96	5.41	7.45	6.84	48.18	17.68	25.87	20.46	16.06	9.13	11.64	9.92
AVGZSLNet [21]	13.02	2.88	4.71	5.44	56.26	34.37	42.67	35.66	14.81	11.11	12.70	12.39
AVCA [25]	32.47	6.81	11.26	8.16	34.90	38.67	36.69	38.67	24.04	19.88	21.76	20.88
Hyper-multiple [12]	21.99	8.12	11.87	8.47	43.52	39.77	41.56	40.28	20.52	21.30	20.90	22.18
ClipClap [14]	29.68	11.12	16.18	11.53	77.14	43.91	55.97	46.96	45.98	20.06	27.93	22.76
EZ-AVOOD (Ours)	39.33	11.84	18.21	13.28	83.53	48.01	60.97	50.92	41.56	21.06	27.95	25.20

Table 1: Comparison with existing state-of-the-art methods on VGGSound-GZSL^cls, UCF-GZSL^cls and ActivityNet-GZSL^cls datasets. Performances in percentage of GZSL (

acc_{\boldsymbol{S}}

acc_{\boldsymbol{U}}

/H) and ZSL (

acc_{ZSL}

) are reported. For fair comparison, results of all five baseline methods are obtained using audio-visual features and text embeddings extracted by CLIP and CLAP models. Bold values represent the best results and the second-ranked numbers are underlined.

3.2.3 Unseen Classifier

Here we fine-tune the ClipClap [14] model to enhance the unseen classes average accuracy for the purpose of improving final GZSL performance. As illustrated in Figure 2, the general framework consists of two branches of Encoder-Encoder-Decoder pipeline to deal with concatenated audio-visual features and fused text embeddings, respectively. With respect to the feature extraction, to be specific, visual features $\boldsymbol{v}$ and part of concatenated text embeddings $\boldsymbol{t^{v}}$ are extracted by vision-language pre-trained model CLIP [29] and CLAP [23] model dedicated to audio-language tasks produces audio features $\boldsymbol{a}$ and another part of text embeddings $\boldsymbol{t^{a}}$ .

The first encoder block $\boldsymbol{O}_{enc}$ from audio-visual branch takes concatenated features $\boldsymbol{x}=\boldsymbol{a}\oplus\boldsymbol{v}$ as input and outputs the multimodal sample features $\boldsymbol{o}$ :

\boldsymbol{o}=\boldsymbol{O}_{enc}(\boldsymbol{x})~{}.

(7)

In the same way, we get unified text embeddings $\boldsymbol{w}$ with encoder $\boldsymbol{W}_{enc}$ :

\boldsymbol{w}=\boldsymbol{W}_{enc}(\boldsymbol{t^{a}}\oplus\boldsymbol{t^{v}}% )~{}.

(8)

With multimodal sample features $\boldsymbol{o}$ and fused text embeddings $\boldsymbol{w}$ as the input of the rest two simple and effective Encoder-Decoder compound modules, we have the following formulations:

\boldsymbol{\theta}_{o}=\boldsymbol{O}_{proj}(\boldsymbol{o});\ \boldsymbol{% \rho}_{o}=\boldsymbol{D}_{o}(\boldsymbol{\theta}_{o})~{},

(9)

\boldsymbol{\theta}_{w}=\boldsymbol{W}_{proj}(\boldsymbol{w});\ \boldsymbol{% \rho}_{w}=\boldsymbol{D}_{w}(\boldsymbol{\theta}_{w})~{},

(10)

where $\boldsymbol{\theta}_{o}$ and $\boldsymbol{\theta}_{w}$ represent the projection outcomes, while the reconstruction process produces $\boldsymbol{\rho}_{o}$ and $\boldsymbol{\rho}_{w}$ . Then the training objectives of unseen classifier include Cross Entropy Loss $\boldsymbol{\mathcal{L}}_{xe}$ :

\boldsymbol{\mathcal{L}}_{xe}=-\frac{1}{n}\sum_{i}^{n}\log\left(\frac{\exp% \left(\boldsymbol{\theta}_{w_{y_{i}^{s}}}\boldsymbol{\theta}_{o_{i}}\right)}{% \sum_{j}^{C_{s}}\exp\left(\boldsymbol{\theta}_{w_{y_{j}^{s}}}\boldsymbol{% \theta}_{o_{i}}\right)}\right)~{},

(11)

where $y_{i}^{s}$ is the label of seen sample $i$ , and $\boldsymbol{\theta}_{w_{y_{i}^{s}}}$ denotes the $\boldsymbol{\theta}_{w}$ -projection of text embedding belonging to seen class $y_{i}^{s}$ . $n$ and $C_{s}$ represent the number of training samples and seen categories, respectively. Another loss function is Reconstruction Loss $\boldsymbol{\mathcal{L}}_{rec}$ to minimize the discrepancy between $\boldsymbol{\rho}$ and text embeddings $\boldsymbol{w}$ with MSE (mean squared error):

\boldsymbol{\mathcal{L}}_{rec}=\frac{1}{n}\sum_{i}^{n}\left[{\left(\boldsymbol% {\rho}_{o_{i}}-\boldsymbol{w}_{i}\right)}^{2}+{\left(\boldsymbol{\rho}_{w_{i}}% -\boldsymbol{w}_{i}\right)}^{2}\right]~{}.

(12)

Moreover, a Regression Loss $\boldsymbol{\mathcal{L}}_{reg}$ calculated by MSE function between $\boldsymbol{\theta}_{o}$ and $\boldsymbol{\theta}_{w}$ is defined as:

\boldsymbol{\mathcal{L}}_{reg}=\frac{1}{n}\sum_{i}^{n}{\left(\boldsymbol{% \theta}_{o_{i}}-\boldsymbol{\theta}_{w_{i}}\right)}^{2}~{}.

(13)

And the overall loss for unseen classifier is defined as:

\boldsymbol{\mathcal{L}}_{total}=\boldsymbol{\mathcal{L}}_{xe}+\boldsymbol{% \mathcal{L}}_{rec}+\boldsymbol{\mathcal{L}}_{reg}~{}.

(14)

Following the original loss function design, there is no weight parameter applied on final loss $\boldsymbol{\mathcal{L}}_{total}$ .

During test phase, the classification result is determined by the nearest neighbor principle which means class text embedding closest to sample feature projection $\boldsymbol{\theta}_{o}^{i}$ is selected as the predicted label $c_{i}$ :

c_{i}=\underset{j}{\operatorname{argmin}}\left(\left\|\boldsymbol{\theta}_{w}^% {j}-\boldsymbol{\theta}_{o}^{i}\right\|_{2}\right)~{},

(15)

where $\boldsymbol{\theta}_{w}^{j}$ is the encoded text embeddings corresponding to class $i$ .

Methods	VGGSound-GZSL^main				UCF-GZSL^main				ActivityNet-GZSL^main
Methods	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$	$acc_{\boldsymbol{S}}$	$acc_{\boldsymbol{U}}$	$\boldsymbol{H}$	$acc_{ZSL}$
CJME [28]	8.69	4.78	6.17	5.16	26.04	8.21	12.48	8.29	5.55	4.75	5.12	5.84
AVGZSLNet [21]	18.15	3.48	5.83	5.28	52.52	10.90	18.05	13.65	8.93	5.04	6.44	5.40
TCaF [24]	9.64	5.91	7.33	6.06	58.60	21.74	31.72	24.81	18.70	7.50	10.71	7.91
VIB-GZSL [17]	18.42	6.00	9.05	6.41	90.35	21.41	34.62	22.49	22.12	8.94	12.73	9.29
AVMST [15]	14.14	5.28	7.68	6.61	44.08	22.63	29.91	28.19	17.75	9.90	12.71	10.37
MDFT [16]	16.14	5.97	8.72	7.13	48.79	23.11	31.36	31.53	18.32	10.55	13.39	12.55
Hyper-multiple [12]	15.02	6.75	9.32	7.97	63.08	19.10	29.32	22.24	23.38	8.67	12.65	9.50
AVCA [25]	14.90	4.00	6.31	6.00	51.53	18.43	27.15	20.01	24.86	8.02	12.13	9.13
OOD-entropy+AVCA [35]	13.31	7.01	9.19	7.48	63.94	26.99	37.96	30.56	29.84	9.54	14.46	11.41
EZ-OOD+AVCA (Ours)	24.94	6.38	10.16	7.48	79.71	27.94	41.38	30.56	30.65	9.29	14.26	11.41

Table 2: Compatibility of EZ-OOD with existing method. We make comparisons between existing state-of-the-art AV-GZSL methods and our new model on VGGSound-GZSL^main, UCF-GZSL^main, and ActivityNet-GZSL^main datasets. GZSL (

acc_{\boldsymbol{S}}

acc_{\boldsymbol{U}}

\boldsymbol{H}

) and ZSL (

acc_{ZSL}

) performances are reported in percentage. Bold numbers denote the best results and the second highest values are underlined.

4 Experiments and Results Analysis

4.1 Setup for Audio-visual GZSL

4.1.1 Datasets and Evaluation Metrics

Following experimental setting in AVCA [25], we adopt the curated version of three audio-visual datasets: VGGSound [5], UCF101 [30], and ActivityNet [3] to evaluate our EZ-AVOOD method and they are VGGSound-GZSL^cls, UCF-GZSL^cls and ActivityNet-GZSL^cls. The upper $cls$ represents $cls$ - $split$ introduced in [24] instead of $main$ - $split$ utilized in [25].

Consistent with ZSL conventions [36, 25], we adopt the average per-class classification accuracy as the evaluation metric, where $acc_{\boldsymbol{S}}$ and $acc_{\boldsymbol{U}}$ denotes the mean class accuracy of seen classes and unseen classes separately. To comprehensively evaluate GZSL performance, the harmonic mean $\boldsymbol{H}$ of seen and unseen accuracy is calculated as:

\boldsymbol{H}=\frac{2*acc_{\boldsymbol{S}}*acc_{\boldsymbol{U}}}{acc_{% \boldsymbol{S}}+acc_{\boldsymbol{U}}}~{}.

(16)

For ZSL tasks aimed to classify unseen samples only, mean class accuracy $acc_{ZSL}$ is also obtained.

4.1.2 Implementation Details

OOD Detector The trained seen classifier is transformed into our OOD detector to produce logits and energy score. As for residual score, the dimension N of principal subspace and the scaling factor $\gamma$ are valued at 64/90 for VGGSound-GZSL^cls, 256/205 for UCF-GZSL^cls and 256/285 for ActivityNet-GZSL^cls.

More details about Feature Extractor, Seen Classifier, and Unseen Classifier are provided in supplementary material.

4.2 Experimental Results

4.2.1 Quantitative Results

As shown in Table 1, our EZ-AVOOD model consistently takes the lead on all three benchmarks in terms of both harmonic mean $\boldsymbol{H}$ for GZSL task and $acc_{ZSL}$ under ZSL setup. For VGGSound-GZSL^cls dataset, we achieve the best performances on all metrics and specially EZ-AVOOD considerably outperforms current state-of-the-art ClipClap with the lead of 9.65%@ $acc_{\boldsymbol{S}}$ , 0.72%@ $acc_{\boldsymbol{U}}$ , 2.03%@ $\boldsymbol{H}$ , and 1.75%@ $acc_{ZSL}$ . In addition, our method substantially overtakes the ClipClap on UCF-GZSL^cls benchmark with even bigger lead margins of 6.39%@ $acc_{\boldsymbol{S}}$ , 4.10%@ $acc_{\boldsymbol{U}}$ , 5.00%@ $\boldsymbol{H}$ , and 3.96%@ $acc_{ZSL}$ respectively. Though the proposed EZ-AVOOD “merely” takes the second place on $acc_{\boldsymbol{S}}$ and $acc_{\boldsymbol{U}}$ of ActivityNet-GZSL^cls, in which ClipClap and Hyper-multiple ranks the top separately, our method does holds the first place on the more comprehensive metric $\boldsymbol{H}$ and attains significant performance on $acc_{ZSL}$ better than all other baseline methods with a 2.44% lead margin at least.

4.3 Compatibility of EZ-OOD with Existing Method

4.3.1 Experimental Setup

Here, we replace the unseen classifier of EZ-AVOOD with AVCA [25] and explore the new model’s experimental performance. We provide some key details of this experiment: $main$ - $split$ of three datasets is adopted; audio features and visual features are extracted by self-supervised SeLaVi [2] pre-trained on VGGSound dataset; and text embeddings are obtained using word2vec model [26] pre-trained with Wikipedia.

4.3.2 Baseline Method

AV-OOD[35] is another OOD-based method that takes the AVCA model as unseen expert and proposes the OOD-entropy method for OOD detection, consequently quite suitable for making contrasts with our method right here. To ensure a just comparison, we only train the seen classifier (also working as the OOD detector) and utilize the same unseen classifier as AV-OOD method. Notably, we re-run the AV-OOD with the provided checkpoint files and get ZSL and GZSL performances to facilitate fair comparison. As can be seen in Table 2, the $acc_{ZSL}$ results of our method and OOD-entropy are totally the same on three benchmarks.

Methods	VGGSound-GZSL^cls				UCF-GZSL^cls				ActivityNet-GZSL^cls
	AUROC	FPR95	AUPR	$\boldsymbol{H}$	AUROC	FPR95	AUPR	$\boldsymbol{H}$	AUROC	FPR95	AUPR	$\boldsymbol{H}$
	$\uparrow$	$\downarrow$	$\uparrow$	$\uparrow$	$\uparrow$	$\downarrow$	$\uparrow$	$\uparrow$	$\uparrow$	$\downarrow$	$\uparrow$	$\uparrow$
Residual Score	68.73	85.74	85.51	14.91	91.30	42.09	90.96	58.65	67.30	85.28	44.73	25.34
Energy Score	81.97	67.69	92.65	17.84	80.41	74.09	86.55	47.57	70.99	86.95	54.57	26.13
EZ-OOD (full)	84.33	66.24	93.82	18.21	95.35	33.87	96.01	60.97	77.57	80.61	63.62	27.95

Table 3: Ablation studies on EZ-OOD method. We make a comparison with Energy Score, Resdual Score, and their

\gamma

-weighted sum the full EZ-OOD on VGGSound-GZSL^cls, UCF-GZSL^cls and ActivityNet-GZSL^cls datasets. Out-of-distribution detection metrics (AUROC/FPR95/AUPR) and GZSL (harmonic mean

\boldsymbol{H}

) performance are reported in percentage.

\downarrow

indicates that lower results are better while

\uparrow

means the opposite. Bold values denote the best results and the second-best outcomes are underlined.

4.3.3 Quantitative Results

In the first place, compared with baseline method AVCA, the new model has secured a full-fledged lead on all metrics of three different datasets. More accurately, the greatest lead margins can amount to 28.18%@ $acc_{\boldsymbol{S}}$ , 9.51%@ $acc_{\boldsymbol{U}}$ , 14.23%@ $\boldsymbol{H}$ , and 10.55%@ $acc_{ZSL}$ on UCF-GZSL^main dataset. In addition, our new model significantly increases the ZSL and GZSL performances on the other 2 benchmarks from the baseline which comprehensively verifies the compatibility of the proposed EZ-OOD method.

Secondly, compared with OOD-entropy+AVCA [35], our EZ-OOD leveraging the same unseen classifier attains a remarkable lead on the harmonic mean metric of both VGGSound-GZSL^main and UCF-GZSL^main datasets (0.97% and 3.42% separately), and lags behind by merely 0.2%@ $\boldsymbol{H}$ on ActivityNet-GZSL^main benchmark. Notably, the OOD detection performance of our method is ahead of OOD-entropy actually on ActivityNet benchmark which is illustrated in the supplementary material. Since AV-GZSL evaluates the average per-class classification accuracy, while OOD detection simply considers ID-OOD separation of all test samples without caring about class labels, as a result, better but close OOD performance not always brings stronger GZSL performance.

4.4 Ablation Studies

To gain an insight into the concrete effect of Energy Score and Residual Score, we conduct additional ablation studies to compare the OOD detection performance and AV-GZSL results within EZ-OOD and its two key components. Experimental setup is consistent with the proposed EZ-AVOOD model in Experiment 4.2. Here in Table 3 we report the AUROC (Area Under the Receiver Operating Characteristic curve), FPR95 (FPR@TPR95), and AUPR (Area Under the Precision versus Recall curve) to evaluate OOD detection capability as well as the harmonic mean $\boldsymbol{H}$ of GZSL task on three datasets. Also, we draw the ROC curves belonging to the three methods to explicitly display their OOD detection performance on each benchmark in Figure 4.

4.4.1 Quantitative Results and Qualitative Results

According to the results in Table 3, the full EZ-OOD undoubtedly takes the first place on all three OOD detection metrics with the highest AUROC and AUPR and the lowest FPR95 and naturally achieves the best $\boldsymbol{H}$ for GZSL. In terms of the two components, Energy Score ranks second on VGGSound-GZSL^cls and ActivityNet-GZSL^cls datasets and Residual Score attains a remarkable lead over Energy Score on UCF-GZSL^cls benchmark. Moreover, we observe that the harmonic mean $\boldsymbol{H}$ produce by Energy Score@VGGSound-GZSL^cls (17.84%) and Residual Score@UCF-GZSL^cls (58.65%) effortlessly defeat all the contrasting methods in Table 1. Therefore, we conclude that both Energy Score and Residual Score play a vital role in separating seen and unseen samples to facilitate subsequent ZSL classification objectives.

Figure 4 depicts the AUROC discrepancy between EZ-OOD score and its two crucial components on three benchmarks. In addition, two individual OOD scores manifest competitive detection performance whose ROC curves are close to the upper-left in the graph. To sum up, the proposed OOD score effectively combines the strengths of the two powerful scores with the weighted sum to achieve stronger OOD detection performance and higher GZSL classification accuracy.

4.5 Parameter Sensitivity Studies

4.5.1 Effect of Scaling Factor $\gamma$

Here we test the scaling factor $\gamma$ from the set: $\gamma\in\{0.1,1,10,100,250,500,1000\}$ with a fixed N specific to each dataset. Typically, better OOD detection performance will bring higher $\boldsymbol{H}$ for GZSL task, hence, AUROC is evaluated to avoid extra computational burden instead of $\boldsymbol{H}$ . We follow the same experimental setup in Section 4.2. Figure 5 illustrates the AUROC on three datasets with different scaling factors $\gamma$ . When the scaling factor is valued at 0.1 or 1000, the linear combination EZ-OOD score will reduce to ordinary energy score or individual residual score, resulting in lower AUROC at both ends of the curve, which is consistent with the outcomes in ablation studies. Moreover, our method is capable of effectively integrating the discriminative information held by class-specific energy score and class-agnostic residual score under a wide range of scaling factors to achieve enhanced OOD detection performance and better audio-visual GZSL results than individual scores.

4.5.2 Effect of Principle Subspace Dimension N

Different N parameters will have a direct influence on the OOD detection performance of residual score, followed by producing different EZ-OOD scores and finally change the overall audio-visual GZSL results. Additionally, since the concatenated audio-visual feature is 1536-d and here we adopt different N as 32, 64, 128, 256, 384, 512 and 768 together with a fixed $\gamma$ for each benchmark, and the AUROC value is reported as the evaluation metric. As depicted in Figure 6, the fitted curves reach the peak at $N=64$ , $N=256$ and $N=256$ for VGGSound-GZSL^cls, UCF-GZSL^cls and ActivityNet-GZSL^cls benchmarks respectively and are generally “flat” which indicates the proposed EZ-OOD method is less sensitive with the dimension N. As a result, we can select this hyperparameter from a wide range of numbers with little influence on the final GZSL performance, which convincingly validates the excellent robustness of our method.

5 Conclusion

In this paper, we propose an extremely simple OOD detection based model EZ-AVOOD for Audio-Visual Generalized Zero-Shot Learning (AV-GZSL) by ingeniously integrating the discriminative information held by class-specific logits and class-agnostic feature subspace. Superior experimental results on 3 audio-visual datasets fully demonstrate the effectiveness of our model. Moreover, the excellent compatibility of the proposed OOD detection method EZ-OOD is verified through deploying a different unseen classifier to construct a new model that outperforms the contrasting methods on both OOD detection performance and GZSL classification accuracy. Therefore, we conclude that EZ-AVOOD is new state-of-the-art of AV-GZSL.

References

Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223. PMLR, 2017.
Asano et al. [2020] Yuki Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision. NeurIPS, 33:4660–4671, 2020.
Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
Chao et al. [2016] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, pages 52–68. Springer, 2016.
Chen et al. [2020] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP, pages 721–725. IEEE, 2020.
Chen et al. [2023a] Yiye Chen, Yunzhi Lin, Ruinian Xu, and Patricio A Vela. Wdiscood: Out-of-distribution detection via whitened linear discriminant analysis. In ICCV, pages 5298–5307, 2023a.
Chen et al. [2023b] Zailong Chen, Lei Wang, Peng Wang, and Peng Gao. Question-aware global-local video understanding network for audio-visual question answering. IEEE TCSVT, 2023b.
Djurisic et al. [2022] Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. arXiv preprint arXiv:2209.09858, 2022.
Dong et al. [2022] Guan-Nan Dong, Chi-Man Pun, and Zheng Zhang. Temporal relation inference network for multimodal speech emotion recognition. IEEE TCSVT, 32:6472–6485, 2022.
Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. NeurIPS, 30, 2017.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
Hong et al. [2023] Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In ICCV, pages 7873–7883, 2023.
Huang and Li [2021] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, pages 8710–8719, 2021.
Kurzendörfer et al. [2024] David Kurzendörfer, Otniel-Bogdan Mercea, A Koepke, and Zeynep Akata. Audio-visual generalized zero-shot learning using pre-trained large multi-modal models. In CVPR, pages 2627–2638, 2024.
Li et al. [2023a] Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, and Xiaopeng Fan. Modality-fusion spiking transformer network for audio-visual zero-shot learning. In ICME, pages 426–431. IEEE, 2023a.
Li et al. [2023b] Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio-visual zero-shot learning. In ACM Multimedia, pages 3994–4002, 2023b.
Li et al. [2023c] Yapeng Li, Yong Luo, and Bo Du. Audio-visual generalized zero-shot learning based on variational information bottleneck. In ICME, pages 450–455. IEEE, 2023c.
Liang et al. [2017] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
Lin et al. [2021] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution detection. In CVPR, pages 15313–15323, 2021.
Liu et al. [2020] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. NeurIPS, 33:21464–21475, 2020.
Mazumder et al. [2021] Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, and Vinay P Namboodiri. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In WACV, pages 3090–3099, 2021.
Mei et al. [2022] Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D Plumbley, and Wenwu Wang. Diverse audio captioning via adversarial training. In ICASSP, pages 8882–8886. IEEE, 2022.
Mei et al. [2024] Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
Mercea et al. [2022a] Otniel-Bogdan Mercea, Thomas Hummel, A Sophia Koepke, and Zeynep Akata. Temporal and cross-modal attention for audio-visual zero-shot learning. In ECCV, pages 488–505. Springer, 2022a.
Mercea et al. [2022b] Otniel-Bogdan Mercea, Lukas Riesch, A Koepke, and Zeynep Akata. Audio-visual generalised zero-shot learning with cross-modal attention and language. In CVPR, pages 10553–10563, 2022b.
Mikolov et al. [2017] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405, 2017.
Mo and Morgado [2025] Shentong Mo and Pedro Morgado. Audio-visual generalized zero-shot learning the easy way. In ECCV, pages 377–395. Springer, 2025.
Parida et al. [2020] Kranti Parida, Neeraj Matiyali, Tanaya Guha, and Gaurav Sharma. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In WACV, pages 3251–3260, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
Soomro [2012] K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Sperber et al. [2019] Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics, 7:313–325, 2019.
Sun et al. [2022] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In ICML, pages 20827–20840. PMLR, 2022.
Vaswani [2017] A Vaswani. Attention is all you need. NeurIPS, 2017.
Wang et al. [2023] Lanxiao Wang, Heqian Qiu, Benliu Qiu, Fanman Meng, Qingbo Wu, and Hongliang Li. Tridentcap: Image-fact-style trident semantic framework for stylized image captioning. IEEE TCSVT, 2023.
Wen [2024] Liuyuan Wen. Out-of-distribution detection for audio-visual generalized zero-shot learning: A general framework. arXiv preprint arXiv:2408.01284, 2024.
Xian et al. [2018] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 41(9):2251–2265, 2018.
Xie et al. [2019] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
Ye et al. [2016] Qiaolin Ye, Jian Yang, Fan Liu, Chunxia Zhao, Ning Ye, and Tongming Yin. L1-norm distance linear discriminant analysis based on an effective iterative algorithm. IEEE TCSVT, 28(1):114–129, 2016.
Zhang and Lu [2018] Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In ECCV, pages 686–701, 2018.
Zheng et al. [2023] Qichen Zheng, Jie Hong, and Moshiur Farazi. A generative approach to audio-visual generalized zero-shot learning: Combining contrastive and discriminative techniques. In IJCNN, pages 1–8. IEEE, 2023.