(Translated by https://www.hiragana.jp/)
DATTA: Towards Diversity Adaptive Test-Time Adaptation in Dynamic Wild World
11institutetext: Shenzhen Technology University, Shenzhen, China
22institutetext: Tsinghua University, Shenzhen, China
22email: youngyorkye@gmail.com, 20210080214@stumail.sztu.edu.cn

DATTA: Towards Diversity Adaptive Test-Time Adaptation in Dynamic Wild World

Chuyang Ye* 11    Dongyan Wei* 11    Zhendong Liu 11    Yuanyi Pang 11    Yixi Lin 11    Jiarong Liao 11    Qinting Jiang 22    Xianghua Fu 11    Qing Li 11    Jingyan Jiang(✉) 11
Abstract

Test-time adaptation (TTA) effectively addresses distribution shifts between training and testing data by adjusting models on test samples, which is crucial for improving model inference in real-world applications. However, traditional TTA methods typically follow a fixed pattern to address the dynamic data patterns (low-diversity or high-diversity patterns) often leading to performance degradation and consequently a decline in Quality of Experience (QoE). The primary issues we observed are: 1) Different scenarios require different normalization methods (e.g., Instance Normalization (IN) is optimal in mixed domains but not in static domains). 2) Model Fine-Tuning can potentially harm the model and waste time. Hence, it is crucial to design strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance. Based on these observations, this paper proposes a new general method, named Diversity Adaptive Test-Time Adaptation (DATTA), aimed at improving QoE. DATTA dynamically selects the best batch normalization methods and fine-tuning strategies by leveraging the Diversity Score to differentiate between high and low diversity score batches. It features three key components: Diversity Discrimination (DD) to assess batch diversity, Diversity Adaptive Batch Normalization (DABN) to tailor normalization methods based on DD insights, and Diversity Adaptive Fine-Tuning (DAFT) to selectively fine-tune the model. Experimental results show that our method achieves up to a 21% increase in accuracy compared to state-of-the-art methodologies, indicating that our method maintains good model performance while demonstrating its robustness. Our code will be released soon.

Keywords:
Quality of Experience Test-time Adaptation Test-time Normalization Domain Generalization Domain Adaptation.
* Equal Contribution
Corresponding Authors

1 Introduction

Despite the considerable progress made with deep neural networks (DNNs), models trained on a source domain often experience a significant drop in performance when tested in a different environment (e.g., target domains) [12, 8, 3, 22]. Such changes in data distribution—caused by factors like different camera sensors, weather conditions, or geographic regions—lead to a decline in inference service performance, resulting in poor Quality of Experience (QoE) for users. This performance degradation can even lead to critical failures, especially in high-stakes applications such as autonomous driving and mobile healthcare [1, 11]. To address this issue, Test-Time Adaptation (TTA) seeks to adapt models online without the source datasets and ground truth labels of test data streams [26].

Existing TTA methods typically involve two steps: 1) (Re-)correcting Batch Normalization Statistics: Various batch normalization techniques are used to adjust batch normalization statistics. Examples include Source Batch Normalization (SBN) [9], Test-Time Batch Normalization (TBN) [26], and methods based on Instance Normalization (IN) [6, 29]. 2) Fine-tuning Model Parameters: This can be done through partial updating optimization (adjusting affine parameters of models using self-supervision losses, such as entropy loss [26, 6, 35]) or fully backward optimization (adjusting all parameters of models [27]).

However, previous TTA studies have mainly focused on the static data streams where the test data stream changes slightly and test samples within a batch are drawn from one data distribution, referred to as low-diversity pattern [26, 28, 6, 35]. However, in real-world applications, the test data stream often exhibits a dynamic nature: within a batch of data, the test samples can come from one or multiple different data sources, referred to as the high-diversity pattern. This pattern of test data streams poses significant challenges for maintaining the QoE of the intelligent services, as traditional TTA methods may not be robust enough to handle such scenarios effectively.

Traditional TTA methods with fixed strategies struggle to address the dynamic data streams characterized by high-diversity patterns. As analyzed in §Sec. 2.2, our measurements reveal several key insights:

  • Specific Batch Normalization techniques are inadequate for dynamic data patterns. When test samples are of low diversity, the use of TBN to correct SBN statistics can enhance performance. However, when test samples have high diversity, IN proves more effective in handling diverse data distributions.

  • Back-propagation can be a double-edged sword in dynamic data patterns. For test samples with high-diversity, the back-propagation process can significantly decrease accuracy. Conversely, this process can improve accuracy when data are sampled with low-diversity.

Refer to caption
Figure 1: DATTA overview. DATTA consists of three modules: DD takes advantage of an Instance-Normalization-guided projection to capture the data features. Based on the discrimination results, DABN and AMF conduct an adaptive BN re-correcting and model fine-tuning strategy.

Therefore, it is crucial to design strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance. Motivated by these observations, we propose a one-size-fits-all approach (see Fig. 1, called Diversity Adaptive Test-Time Adaptation (DATTA). The main idea of DATTA (Diversity Adaptive Test-Time Adaptation) is to distinguish between high and low diversity score batches by calculating the diversity score of data batches in dynamic scenarios. By adaptively adjusting the batch normalization method and fine-tuning model parameters according to the characteristics of the data batches, DATTA enhances model robustness. Our DATTA includes three key components: Diversity Discrimination (DD), Diversity Adaptive Batch Normalization (DABN), and Diversity Adaptive Fine-Tuning (DAFT).

In the DD component, we compute a few statistics for the Diversity Score in test batch data to identify high and low diversity batches. In DABN, we introduce a dynamically aggregated batch normalization that considers SBN, TBN, and IN based on the result of DD, enabling the model to obtain a more robust representation. In DAFT, the model dynamically selects data batches for fine-tuning based on the diversity score to prevent error accumulation and crashes. Our contributions can be summarized as follows:

  • Effectiveness. We propose an one-size-fits-all approach, that utilizes diversity score based on angle variance to differentiate between various scenarios. Our DABN empowers the model to achieve a more robust representation, enabling it to adapt effectively in both low and high diversity patterns. Moreover, our method circumvents unnecessary or even harmful model fine-tuning, paving the way for further enhancements. Experiments on benchmark datasets demonstrate robust performance compared to state-of-the-art studies under, with up to a 21% increase in accuracy.

  • Efficiency. We introduce a lightweight distribution discriminator module that can be executed within a single forward propagation. Our methods considerably can transition to a backward-free method in high-diversity data patterns, thereby reducing computational expenses.

  • Promising Results. We conduct experiments on mainstream shift datasets and show that our method demonstrates robustness while maintaining good performance of the model. It can effectively respond to data stream patterns, and the selective model Fine-Tuning approach is more lightweight. Empirical evaluations on benchmark datasets indicate a substantial improvement in performance, achieving up to a 21% increase in accuracy compared to state-of-the-art methodologies.

2 Background and Motivation

2.1 Revisiting TTA

Test-time Adaptation. Let 𝒟𝒮={𝒳𝒮,𝒴}subscript𝒟𝒮superscript𝒳𝒮𝒴\mathcal{D}_{\mathcal{S}}=\left\{\mathcal{X}^{\mathcal{S}},\mathcal{Y}\right\}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = { caligraphic_X start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT , caligraphic_Y } denote the source domain data and 𝒟𝒯={𝒳𝒯,𝒴}subscript𝒟𝒯superscript𝒳𝒯𝒴\mathcal{D}_{\mathcal{T}}=\left\{\mathcal{X}^{\mathcal{T}},\mathcal{Y}\right\}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = { caligraphic_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT , caligraphic_Y } denote the target domain data. Each data instance and corresponding label pair (𝐱i,yi)𝒳𝒮×𝒴subscript𝐱𝑖subscript𝑦𝑖superscript𝒳𝒮𝒴\left(\mathbf{x}_{i},y_{i}\right)\in\mathcal{X}^{\mathcal{S}}\times\mathcal{Y}( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT × caligraphic_Y in the source domain follows a distribution P𝒮(𝐱,y)subscript𝑃𝒮𝐱𝑦P_{\mathcal{S}}(\mathbf{x},y)italic_P start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( bold_x , italic_y ). Similarly, each target test sample and its label at test time t𝑡titalic_t, (𝐱t,yt)𝒳𝒯×𝒴subscript𝐱𝑡subscript𝑦𝑡superscript𝒳𝒯𝒴\left(\mathbf{x}_{t},y_{t}\right)\in\mathcal{X}^{\mathcal{T}}\times\mathcal{Y}( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT × caligraphic_Y, follow a distribution P𝒯(𝐱,y)subscript𝑃𝒯𝐱𝑦P_{\mathcal{T}}(\mathbf{x},y)italic_P start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_x , italic_y ), with ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unknown to the learner. The standard covariate shift assumption in domain adaptation is P𝒮(𝐱)P𝒯(𝐱)subscript𝑃𝒮𝐱subscript𝑃𝒯𝐱P_{\mathcal{S}}(\mathbf{x})\neq P_{\mathcal{T}}(\mathbf{x})italic_P start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( bold_x ) ≠ italic_P start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_x ) and P𝒮(y𝐱)=P𝒯(y𝐱)subscript𝑃𝒮conditional𝑦𝐱subscript𝑃𝒯conditional𝑦𝐱P_{\mathcal{S}}(y\mid\mathbf{x})=P_{\mathcal{T}}(y\mid\mathbf{x})italic_P start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_y ∣ bold_x ) = italic_P start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_y ∣ bold_x ). Unlike traditional domain adaptation, which uses pre-collected 𝒟𝒮subscript𝒟𝒮\mathcal{D}_{\mathcal{S}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝒳𝒯superscript𝒳𝒯\mathcal{X}^{\mathcal{T}}caligraphic_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, TTA continuously adapts a pre-trained model fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) from 𝒟𝒮subscript𝒟𝒮\mathcal{D}_{\mathcal{S}}caligraphic_D start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT using only the test sample obtained at time t𝑡titalic_t.

TTA on dynamic streams. Previous TTA methods typically assume that at each time t𝑡titalic_t, each target sample (𝐱t,yt)𝒳𝒯×𝒴subscript𝐱𝑡subscript𝑦𝑡superscript𝒳𝒯𝒴\left(\mathbf{x}_{t},y_{t}\right)\in\mathcal{X}^{\mathcal{T}}\times\mathcal{Y}( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT × caligraphic_Y follows a time-invariant distribution P𝒯(𝐱,y)subscript𝑃𝒯𝐱𝑦P_{\mathcal{T}}(\mathbf{x},y)italic_P start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_x , italic_y ), denoted as the low-diversity pattern. However, in many real-world scenarios, the data obtained at test time are dynamic and come from multiple sources. Specifically, the data may be drawn from one or multiple distributions {P𝒯i}i=1Msuperscriptsubscriptsuperscriptsubscript𝑃𝒯𝑖𝑖1𝑀\{P_{\mathcal{T}}^{i}\}_{i=1}^{M}{ italic_P start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, denoted as the high-diversity pattern.

2.2 Motivation Observations

In this section, we explore the performance of current TTA methods in dynamic scenarios. Our findings indicate that traditional static Batch Normalization (BN) designs and static fine-tuning methods are inadequate for adapting to these dynamic environments. Additionally, we investigate how increasing diversity in data distributions in dynamic scenarios exacerbates the impact on performance.

Refer to caption
Figure 2: (a) No one BN fits all data diversity patterns. We compare the different BN in TTA under the different data patterns without a backward process. 0.40.40.40.4-BN, 0.60.60.60.6-BN and 0.80.80.80.8-BN use α𝛼\alphaitalic_α-BN [34], where the parameters used by α𝛼\alphaitalic_α are 0.40.40.40.4, 0.60.60.60.6 and 0.80.80.80.8, respectively. IABN is introduced by NOTE [6]. (b) Model fine-tuning in different patterns. (c) Increasing the Domain Number in the data stream affects the Accuracy of the method. (d) Impact of the number of domains on diversity score. We analyze the variation of the diversity score for different mixes of domain numbers and show the plausibility of the diversity score.

Observation 1: No One-size-fits-all BN methods for different data diversity patterns. We evaluate conventional BN statistics adaptation in TTA methods, including SBN, TBN, BN stats [34] (a combination of TBN and SBN where a larger α𝛼\alphaitalic_α indicates greater participation of SBN), and Instance Aware Batch Normalization (IABN) [6], under different data diversity patterns with out backward process.

As shown in Fig. 2(a), in the high diversity pattern, all BN methods experience significant performance drops. Specifically, the α𝛼\alphaitalic_α-BN method with α=0.6𝛼0.6\alpha=0.6italic_α = 0.6, which performs well in the low diversity pattern, sees a decrease in accuracy of over 10%. Additionally, performance patterns that excel in low-diversity settings cannot maintain their effectiveness in high-diversity scenarios. For instance, the accuracy of α𝛼\alphaitalic_α-BN with α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 drops significantly, highlighting the challenge of maintaining performance across diverse data distributions. Meanwhile, IABN stands out in high diversity scenarios, demonstrating superior performance with an accuracy of approximately 66.63%. This suggests that when test samples come from multiple domains, both IN and SBN are needed to correct the BN statistics effectively.

Observation 2: Fine-tuning in high-diversity patterns could potentially harm the model. We compare the accuracy of model fine-tuning for CoTTA [27], TENT [25], NOTE [5], SAR [18], RoTTA [36], and ViDA [13] under low-diversity and high-diversity patterns. As shown in Fig. 2, the results demonstrate that when test samples originate from multiple distributions, NOTE conducting model fine-tuning can lead to a performance reduction of over 3%percent33\%3 % (TENT reduces nearly 2%percent22\%2 %). The potential reason for this could be the accumulation of errors caused by erroneous pseudo-labels, leading to model collapse.

Challenge: How to measure distribution diversity and mitigate its impact on performance? Distribution diversity is defined as the number of different domains from which test data within a single batch are drawn at a given test time. The more distributions present, the greater the diversity. Fig. 2(c) illustrates that increasing the number of domains generally leads to a decrease in accuracy across all methods. For example, both SAR and TENT exhibit significant performance declines as the number of domains increases: SAR drops from 75% to 45%, and TENT falls from 70% to 50%.

To address this challenge, it is crucial to develop strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance.

3 Proposed Methods

Based on the above analysis, the key challenge lies in distinguishing different diversity patterns. To address this, we introduce a diversity discrimination module, detailed in §Sec. 3.1.2, which effectively indicates the degree of diversity, as illustrated in Fig. 2(d). With a diversity score provided by this module, we can intuitively adjust the BN statistics in various ways and design adaptive fine-tuning mechanisms accordingly, detailed in §Sec. 3.2 and §Sec. 3.3.

3.1 Diversity Discrimination (DD)

3.1.1 Diversity Score.

We evaluate batch data distribution diversity by measuring the angle dispersion between the feature map and the TBN mean statistics, using the SBN mean as the reference point. This approach captures how much batch data deviates from the source domain distribution, with greater angle dispersion indicating higher diversity.

We define each activation value in the feature map f𝑓fitalic_f, generated by the model’s first convolutional layer. Each activation value represents the response of a specific filter to a local region of the input image. We assume the average of feature values in the test-time batch normalization as μtestsubscript𝜇test\mu_{\text{test}}italic_μ start_POSTSUBSCRIPT test end_POSTSUBSCRIPT and the average of feature values during the training time of the source model as μsourcesubscript𝜇source\mu_{\text{source}}italic_μ start_POSTSUBSCRIPT source end_POSTSUBSCRIPT.

To quantify this, we introduce the following definitions:

Definition 1.

Discrepancy Angle: The data discrepancy angle θ𝜃\thetaitalic_θ quantifies the difference between the feature vector vfsubscript𝑣𝑓v_{f}italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and the test distribution vector vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It is defined as:

θ=cos1(vfvtvfvt).𝜃superscript1subscript𝑣𝑓subscript𝑣𝑡normsubscript𝑣𝑓normsubscript𝑣𝑡\theta=\cos^{-1}\left(\frac{v_{f}\cdot v_{t}}{\|v_{f}\|\|v_{t}\|}\right).italic_θ = roman_cos start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ ∥ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ) .

Here, the feature vector vfsubscript𝑣𝑓v_{f}italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT represents the difference between the source domain mean and the feature map f𝑓fitalic_f: vf=μsourcefsubscript𝑣𝑓subscript𝜇source𝑓v_{f}=\mu_{\text{source}}-fitalic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT source end_POSTSUBSCRIPT - italic_f. Similarly, the test distribution vector vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as the difference between the source domain mean and the test-time batch mean: vt=μsourceμtestsubscript𝑣𝑡subscript𝜇sourcesubscript𝜇testv_{t}=\mu_{\text{source}}-\mu_{\text{test}}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT source end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT test end_POSTSUBSCRIPT.

The diversity score S𝑆Sitalic_S is defined as the variance of the angles θ𝜃\thetaitalic_θ within each batch. It is calculated as follows:

S=1Ni=1N(θθ¯)2,𝑆1𝑁superscriptsubscript𝑖1𝑁superscript𝜃¯𝜃2\mathit{S}=\frac{1}{N}\sum_{i=1}^{N}(\theta-\overline{\theta})^{2},italic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_θ - over¯ start_ARG italic_θ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where θ¯¯𝜃\overline{\theta}over¯ start_ARG italic_θ end_ARG is the mean of all calculated angles θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the batch, and N𝑁Nitalic_N is the number of samples in the batch.

This method allows us to effectively measure the diversity of data distribution in batch processing, providing a robust metric for analysis without significant computational costs.

3.1.2 Adaptive Discrimination with Diversity Score.

The adaptive discrimination with diversity scores is designed to dynamically distinguish between high-diversity and low-diversity batches in a data stream using the diversity score. This mechanism includes a module called the Diversity Cache, which collects diversity scores during test time. At each time step t𝑡titalic_t, the Diversity Cache stores the diversity score Stsubscript𝑆𝑡\mathit{S}_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the current test data samples and calculates the diversity threshold Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT dynamically.

The diversity threshold Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at test time t𝑡titalic_t is calculated as follows:

Qt=Pλ({S1,S2,,St}),subscript𝑄𝑡subscript𝑃𝜆subscript𝑆1subscript𝑆2subscript𝑆𝑡Q_{t}=P_{\lambda}(\{\mathit{S}_{1},\mathit{S}_{2},\ldots,\mathit{S}_{t}\}),italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) , (2)

where Pλsubscript𝑃𝜆P_{\lambda}italic_P start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT denotes the λ𝜆\lambdaitalic_λ-th percentile function.

In practice, during the initial stages of test-time adaptation, the diversity cache begins by collecting diversity scores from the data stream over a period denoted as Tinitsubscript𝑇initT_{\text{init}}italic_T start_POSTSUBSCRIPT init end_POSTSUBSCRIPT. This cold start phase provides a preliminary assessment of the data distribution for the current service. The diversity scores gathered during this period are utilized to compute an initial diversity threshold, which is instrumental in distinguishing between high-diversity and low-diversity batches within the data stream. After Tinitsubscript𝑇initT_{\text{init}}italic_T start_POSTSUBSCRIPT init end_POSTSUBSCRIPT, the diversity cache continues to collect diversity scores at each step and dynamically updates the diversity threshold using these scores. This continuous update allows the system to flexibly adjust the identification of high-diversity and low-diversity batches based on real-time data.

3.2 Diversity Adaptive Batch Normalization (DABN)

As outlined in §Sec. 2.2, high-diversity scores indicate significant variability, making it difficult for methods suited for low-diversity data to normalize feature maps using test-time statistics. Conversely, low-diversity scores suggest a more concentrated data distribution, where strong corrections using instance normalization statistics can hinder normalization and over-correction can fail to remove uninformative variations.

To address these issues, we propose DABN, which effectively manages varying diversity scores. DABN reduces excessive corrections of BN statistics in low diversity score batches while maintaining robust performance in high diversity score batches. This method also mitigates the issue of internal covariate shift, where modified BN layer outputs diverge from the model’s outputs trained on the source domain. DABN incorporates BN statistics μsourcesubscript𝜇𝑠𝑜𝑢𝑟𝑐𝑒{\mu}_{source}italic_μ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT and σsource2subscriptsuperscript𝜎2𝑠𝑜𝑢𝑟𝑐𝑒{\sigma}^{2}_{source}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT from extensive source domain training into the prediction process. By applying different correction strategies based on the diversity score of the data batch, DABN minimizes internal covariate shifts, thereby improving prediction accuracy.

Drawing from insights in IABN [6], we assume that the sample mean and sample variance follow a sampling distribution with a sample size of L𝐿Litalic_L, represented by a normal distribution. The variances of the sample mean sμsubscript𝑠𝜇s_{\mu}italic_s start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and sample variance sσsubscript𝑠𝜎s_{\sigma}italic_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT are given by:

sμ=σsource2L,sσ=2σsource4L1.formulae-sequencesubscript𝑠𝜇superscriptsubscript𝜎𝑠𝑜𝑢𝑟𝑐𝑒2𝐿subscript𝑠𝜎2superscriptsubscript𝜎𝑠𝑜𝑢𝑟𝑐𝑒4𝐿1s_{\mu}=\frac{\sigma_{source}^{2}}{L},\quad s_{\sigma}=\frac{2\sigma_{source}^% {4}}{L-1}.italic_s start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L end_ARG , italic_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L - 1 end_ARG . (3)

In high-diversity score batches, DABN adjusts the instance normalization statistics μinstancesubscript𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒\mu_{instance}italic_μ start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT and σinstance2subscriptsuperscript𝜎2𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒\sigma^{2}_{instance}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT to align with the source domain batch normalization statistics μsourcesubscript𝜇𝑠𝑜𝑢𝑟𝑐𝑒\mu_{source}italic_μ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT and σsource2subscriptsuperscript𝜎2𝑠𝑜𝑢𝑟𝑐𝑒\sigma^{2}_{source}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT. For low-diversity score batches, DABN primarily relies on the current batch’s batch normalization statistics μtestsubscript𝜇𝑡𝑒𝑠𝑡\mu_{test}italic_μ start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT and σtest2subscriptsuperscript𝜎2𝑡𝑒𝑠𝑡\sigma^{2}_{test}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT to adapt to the current data distribution while mitigating internal covariate shift. Specifically, we use the following statistics:

μDABN=μsource+αψ(μinstance;μtest;μsource;κsμ),subscript𝜇𝐷𝐴𝐵𝑁subscript𝜇𝑠𝑜𝑢𝑟𝑐𝑒𝛼𝜓subscript𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒subscript𝜇𝑡𝑒𝑠𝑡subscript𝜇𝑠𝑜𝑢𝑟𝑐𝑒𝜅subscript𝑠𝜇\mu_{DABN}=\mu_{source}+\alpha\cdot\psi(\mu_{instance};\mu_{test};\mu_{source}% ;\kappa s_{\mu}),italic_μ start_POSTSUBSCRIPT italic_D italic_A italic_B italic_N end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT + italic_α ⋅ italic_ψ ( italic_μ start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT ; italic_κ italic_s start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) , (4)
σDABN2=σsource2+αψ(σinstance2;σtest2;σsource2;κsσ),superscriptsubscript𝜎𝐷𝐴𝐵𝑁2superscriptsubscript𝜎𝑠𝑜𝑢𝑟𝑐𝑒2𝛼𝜓superscriptsubscript𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒2superscriptsubscript𝜎𝑡𝑒𝑠𝑡2superscriptsubscript𝜎𝑠𝑜𝑢𝑟𝑐𝑒2𝜅subscript𝑠𝜎\sigma_{DABN}^{2}=\sigma_{source}^{2}+\alpha\cdot\psi(\sigma_{instance}^{2};% \sigma_{test}^{2};\sigma_{source}^{2};\kappa s_{\sigma}),italic_σ start_POSTSUBSCRIPT italic_D italic_A italic_B italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ⋅ italic_ψ ( italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_s italic_o italic_u italic_r italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; italic_κ italic_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) , (5)

where the function ψ𝜓\psiitalic_ψ is used to adjust the alignment between the instance and source statistics based on the diversity score Stsubscript𝑆𝑡\mathit{S}_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

ψ(x;y;z;κ)={0,if xz=κ and St>Qt,xzκ,if xz>κ and StQt,xz+κ,if xz<κ and StQt,yz,if St<Qt.𝜓𝑥𝑦𝑧𝜅cases0if 𝑥𝑧𝜅 and subscript𝑆𝑡subscript𝑄𝑡𝑥𝑧𝜅if 𝑥𝑧𝜅 and subscript𝑆𝑡subscript𝑄𝑡𝑥𝑧𝜅if 𝑥𝑧𝜅 and subscript𝑆𝑡subscript𝑄𝑡𝑦𝑧if subscript𝑆𝑡subscript𝑄𝑡\psi(x;y;z;\kappa)=\begin{cases}0,&\text{if }x-z=\kappa\text{ and }\mathit{S}_% {t}>Q_{t},\\ x-z-\kappa,&\text{if }x-z>\kappa\text{ and }\mathit{S}_{t}\geq Q_{t},\\ x-z+\kappa,&\text{if }x-z<\kappa\text{ and }\mathit{S}_{t}\geq Q_{t},\\ y-z,&\text{if }\mathit{S}_{t}<Q_{t}.\end{cases}italic_ψ ( italic_x ; italic_y ; italic_z ; italic_κ ) = { start_ROW start_CELL 0 , end_CELL start_CELL if italic_x - italic_z = italic_κ and italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_x - italic_z - italic_κ , end_CELL start_CELL if italic_x - italic_z > italic_κ and italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_x - italic_z + italic_κ , end_CELL start_CELL if italic_x - italic_z < italic_κ and italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_y - italic_z , end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . end_CELL end_ROW (6)

Here, α𝛼\alphaitalic_α is a hyperparameter of DABN determining the adjustment level of current batch information, and κ𝜅\kappaitalic_κ determines the confidence level of source domain statistics.

In summary, DABN is described as follows:

DABN:=γ𝐟μDABNσDABN2+ϵ+β.assignDABN𝛾𝐟subscript𝜇DABNsuperscriptsubscript𝜎DABN2italic-ϵ𝛽\textbf{DABN}:=\gamma\cdot\frac{\mathbf{f}-\mu_{\text{DABN}}}{\sqrt{\sigma_{% \text{DABN}}^{2}+\epsilon}}+\beta.DABN := italic_γ ⋅ divide start_ARG bold_f - italic_μ start_POSTSUBSCRIPT DABN end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUBSCRIPT DABN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ end_ARG end_ARG + italic_β . (7)

3.3 Diversity Adaptive Fine-Tuning (DAFT)

After updating the BN layer’s statistical values, the model’s affine parameters must be adjusted accordingly. However, not all updates are effective or reliable. Our experiments and analysis indicate that parameter updates are ineffective and potentially detrimental when data comes from high-diversity score batches. Therefore, updates should be applied only when the batch data has a low diversity score to avoid wasteful and harmful adjustments. Following this principle, the model updates parameters exclusively for low-diversity score batches.

The loss function is defined as follows:

=𝕀{St>Qt}Ent𝜽(𝐱),subscript𝕀subscript𝑆𝑡subscript𝑄𝑡subscriptEnt𝜽𝐱\mathcal{L}=\mathbb{I}_{\left\{S_{t}>Q_{t}\right\}}\operatorname{Ent}_{% \boldsymbol{\theta}}(\mathbf{x}),caligraphic_L = blackboard_I start_POSTSUBSCRIPT { italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } end_POSTSUBSCRIPT roman_Ent start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) , (8)

where Ent𝜽(𝐱)subscriptEnt𝜽𝐱\operatorname{Ent}_{\boldsymbol{\theta}}(\mathbf{x})roman_Ent start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) is the cross-entropy loss, 𝐱𝐱\mathbf{x}bold_x is the model input, and 𝕀{S>Q}subscript𝕀𝑆𝑄\mathbb{I}_{\left\{S>Q\right\}}blackboard_I start_POSTSUBSCRIPT { italic_S > italic_Q } end_POSTSUBSCRIPT is the indicator function that equals 1 if the diversity score S𝑆Sitalic_S is greater than the threshold Q𝑄Qitalic_Q, and 0 otherwise. This ensures that parameter updates are only performed when the batch data has a low diversity score, thereby avoiding wasteful and potentially harmful updates when the diversity score is high.

4 Experiments

4.1 Experimental Setup

We achieved the proposed method DATTA and baselines on the TTAB framework [38]. Detailed deployment information, including hyperparameter settings for each baseline, the datasets used, and the software and hardware environment, is provided below.

Environment. The experiments mentioned in this article were carried out utilizing an NVIDIA GeForce RTX 4090 GPU. The experimental code was developed using PyTorch 1.10.1 and Python 3.9.7.

Hyperparameter Configurations. The hyperparameters are divided into two categories: those shared by all baselines and those specific to each method. 1) The shared hyperparameters for model adaptation are as follows: the optimizer used is SGD, the learning rate (LR) is 0.00010.00010.00010.0001, and the batch size for all test-time adaptations is set to 64646464. After the test data is input into the model, all data is first forwarded once to obtain the inference result. 2) The hyperparameters specific to each method are set according to the following references: the hyperparameters for TBN follow the settings in [26]; the hyperparameters for IABN are based on the settings in [6]; and the hyperparameters for α𝛼\alphaitalic_α-BN also follow the settings in [26]. Specifically, For DABN, α𝛼\alphaitalic_α is a hyperparameter that determines the adjustment level based on the current batch information, which is set to 0.2 in our experiments. Additionally, κ𝜅\kappaitalic_κ determines the confidence level of the source domain statistics and is set to 4.

Baselines. We compare our method with various cutting-edge TTA methods. Source assesses the model trained on source data directly on target data without adaptation. BN Stats [14] combines TBN and SBN statistics for updated BN layer statistics. TENT [26] minimizes prediction entropy to boost model confidence, estimates normalization statistics, and updates channel-wise affine transformations online. EATA [17] filters high-entropy samples and uses a Fisher regularizer to stabilize updates and prevent catastrophic forgetting. CoTTA [27] uses weight and augmentation averaging to reduce errors and randomly resets neurons to pre-trained weights to retain knowledge. NOTE [6] corrects normalization for out-of-distribution samples and simulates an i.i.d. data stream from a non-i.i.d. stream. SAR [19] selectively minimizes entropy by excluding noisy samples and optimizing entropy and surface sharpness for stability. RoTTA [35] simulates an i.i.d. data stream by constructing a sampling pool and adapting BN layer statistics. ViDA [13] decomposes features into high-rank and low-rank components for knowledge sharing. We assume the source data is inaccessible during TTA. The model is continuously updated online, without modifying BN during training. Following [38, 6], we use a test batch size of 64 and perform a single adaptation epoch, with method-specific hyperparameters as reported in their papers or official implementations [38].

Datasets. We use the CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets [8] from the TTA benchmark (TTAB) [38] to evaluate model robustness against corruptions. These datasets include 15 corruption types (e.g., Gaussian noise, shot noise, impulse noise, defocus blur) with 5 severity levels each. CIFAR-10-C and CIFAR-100-C are small-scale datasets with 10 and 100 classes respectively, containing 10,000 images per corruption type. ImageNet-C is a large-scale dataset with 1,000 classes and 50,000 images per corruption type.

Scenarios. In our experiments, we utilized four scenarios: Dynamic, Dynamic-S, Non-I.I.D., and Multi-non-I.I.D. The Dynamic scenario involves each batch of input samples being composed of data from several different distributions, leading to a high diversity scenario while being independent and identically distributed (i.i.d.). The Dynamic-S scenario features input samples within each batch that are i.i.d. but either come from multiple domains (resulting in high diversity) or from a single domain (resulting in low diversity). In the Non-I.I.D. scenario, we mix data from 15 domains, and each time the data is input in the form of a mixed domain, causing the domain of the test samples to be unstable and change in real time, representing a low diversity situation within each batch. The Multi-non-I.I.D. scenario, similar to Dynamic, uses batches composed of data from multiple distributions; however, like Non-I.I.D., it mixes data from 15 different domains, resulting in an unstable and dynamically changing domain for the test samples.

Tab. 1: Comparison of state-of-the-art methods on CIFAR10-C, CIFAR100-C, and ImageNet-C at severity level 5 with a Batch Size of 64 under Dynamic and Dynamic-S scenarios, evaluated by Accuracy (%). The bold value signifies the best result, and the second best accuracy is underlined.
Dynamic Dynamic-S
Method Venue CIFAR10-C CIFAR100-C ImageNet-C  Avg. \uparrow CIFAR10-C CIFAR100-C ImageNet-C  Avg. \uparrow  Avg-All\uparrow
Source CVPR’16 57.39 28.59 25.96 37.31 57.38 28.58 25.77 37.24 37.28
BN Stats ICLR’21 62.41 33.23 22.37 39.33 69.86 41.62 25.42 45.63 42.48
TENT CVPR’21 63.05 32.48 18.47 38.00 71.58 41.13 23.71 45.47 41.73
EATA ICML’22 59.97 34.52 19.35 37.94 67.97 41.95 23.05 44.33 41.13
NOTE NIPS 22 61.48 31.91 20.56 37.98 66.58 24.77 21.32 37.55 37.77
CoTTA CVPR’22 48.50 20.26 5.43 24.73 50.31 22.01 25.29 32.54 28.63
SAR ICLR’23 62.60 31.80 18.42 37.60 71.03 40.78 27.73 46.51 42.06
RoTTA CVPR’23 48.96 21.88 22.03 30.95 49.21 24.22 23.90 32.44 31.70
ViDA ICLR’24 59.94 30.75 18.78 36.49 67.95 39.47 10.20 39.21 37.85
Ours Proposed 69.55 36.24 25.05 43.61 72.33 40.83 28.78 47.31 45.46
Tab. 2: Comparison of latency (s) for processing CIFAR-10-C, CIFAR-100-C, and ImageNet-C using a single RTX4090 GPU on ResNet-50.
Method Venue CIFAR10-C CIFAR100-C ImageNet-C
Source CVPR’16 0.007 0.007 0.051
BN Stats ICLR’21 0.018 0.018 0.068
TENT CVPR’21 0.068 0.070 0.169
EATA ICML’22 0.060 0.070 0.170
NOTE NIPS’22 2.142 2.190 1.896
CoTTA CVPR’22 0.543 0.541 5.322
SAR ICLR’23 0.094 0.094 0.295
RoTTA CVPR’23 0.297 0.297 0.603
ViDA ICLR’24 0.532 0.530 5.236
Ours Proposed 0.029 0.029 0.074
Tab. 3: Comparison of state-of-the-art methods on CIFAR10-C, CIFAR100-C, and ImageNet-C at severity level 5 with a Batch Size of 64 under Non-I.I.D. and Multi-non-I.I.D. scenarios, evaluated by Accuracy (%). The bold value signifies the best result, and the second best accuracy is underlined.
Non-I.I.D. Multi-non-I.I.D.
Method Venue CIFAR10-C CIFAR100-C ImageNet-C Avg.\uparrow CIFAR10-C CIFAR100-C ImageNet-C Avg.\uparrow Avg-All\uparrow
Source CVPR’16 57.39 28.59 25.77 37.25 57.39 28.59 25.84 37.27 37.26
BN Stats ICLR’21 27.32 13.11 22.46 20.96 24.84 18.44 18.49 20.59 20.77
TENT CVPR’21 24.40 11.69 22.89 19.66 20.13 14.92 18.13 17.72 18.69
EATA ICML’22 27.43 5.33 24.23 18.99 24.80 9.97 20.51 18.42 18.71
NOTE NIPS’22 64.98 26.29 11.67 34.31 63.24 24.75 10.26 32.74 33.53
CoTTA CVPR’22 20.08 8.73 9.04 12.61 19.11 13.22 5.44 12.58 12.60
SAR ICLR’23 24.78 9.36 23.11 19.08 20.43 14.00 18.35 17.59 18.33
RoTTA CVPR’23 57.47 37.03 26.58 40.36 40.09 22.24 22.26 28.19 34.27
ViDA ICLR’24 27.50 12.92 22.53 20.98 24.77 18.22 18.55 20.51 20.74
Ours Proposed 69.16 35.50 29.14 44.60 67.83 35.47 24.74 42.68 43.64

4.2 Experimental Results and Analysis under Different Scenarios

We performed experiments in four separate scenarios: Dynamic, Dynamic-S, and Non-I.I.D. In alignment with the configurations used in prior studies, we selected the most severely corrupted samples (level 5) from each type of corruption.

Tab. 1 displays the performance outcomes of various TTA methods in Dynamic and Dynamic-S scenarios. It is clear from the table that our approach significantly outperforms other benchmarks in terms of average accuracy across the two scenarios. Notably, in the Dynamic scenario, our method shows a considerable advantage, achieving an average accuracy approximately 19% higher than the lowest benchmark (CoTTA) and about 4% higher than the highest benchmark (BN stats). This indicates that our method has inherent strengths in managing batch data with multiple distributions. In the Dynamic-S scenario, our average accuracy is around 17% higher than the lowest benchmark (CoTTA) and approximately 3% higher than the highest benchmark (BN Stats). This underscores the effectiveness of our method in handling the static and dynamic patterns.

Tab. 2 compares the latency of state-of-the-art methods under Dynamic and Dynamic-S scenarios. Our method shows competitive efficiency, particularly in terms of latency. For CIFAR10-C, our method’s latency is 0.029 seconds, significantly lower than NOTE (2.142 seconds) and CoTTA (0.543 seconds). Although Source (0.007 seconds) has slightly lower latency, our method remains within an acceptable range for practical use. For CIFAR100-C, our method maintains a latency of 0.029 seconds, much lower than NOTE (2.190 seconds) and CoTTA (0.541 seconds). While Source (0.007 seconds) and BN Stats (0.018 seconds) show lower latencies, our method effectively balances efficiency and accuracy. For ImageNet-C, our method achieves a latency of 0.074 seconds, substantially lower than CoTTA (5.322 seconds) and ViDA (5.236 seconds). Although Source (0.051 seconds) has the best latency, our method still outperforms most benchmarks in this scenario. These results demonstrate that our method provides a strong balance between low latency and high performance, making it suitable for real-time applications where both efficiency and accuracy are crucial. This balance enhances QoE, ensuring optimal service performance and satisfaction for end-users.

Tab. 3 presents the accuracy comparison of TTA methods under Non-I.I.D. and Multi-non-I.I.D. scenarios. In the Non-I.I.D. scenario, our method achieves the highest average accuracy of 44.60%, which is approximately 4% higher than the second-best method (RoTTA) with an average accuracy of 40.36%. Specifically, our method achieves the highest accuracy on CIFAR10-C (69.16%) and ImageNet-C (29.14%), and the second highest accuracy on CIFAR100-C (35.50%). These results indicate a significant improvement in handling domain instability and low diversity within each batch. In the Multi-non-I.I.D. scenario, our method also outperforms other benchmarks with an average accuracy of 42.68%, which is about 5% higher than the second highest benchmark (Source) with an average accuracy of 37.26%. Our method shows the highest accuracy on CIFAR10-C (67.83%) and CIFAR100-C (35.47%), and the second highest accuracy on ImageNet-C (24.74%). Across both scenarios, our method achieves an overall average accuracy of 43.64%, which is about 4% higher than the overall second-best method (Source) with an average accuracy of 37.26%. These results demonstrate that our method not only effectively handles dynamically changing and mixed-domain data but also excels in Non-I.I.D. scenarios.

Tab. 4: Comparison of different modules’ performance on CIFAR10-C datasets (severity level 5) with a Batch Size of 64, evaluated by Accuracy (%). Each method was tested with a ResNet-50 model under Dynamic, Dynamic-S, Non-I.I.D. and Multi-non-I.I.D. scenarios. The highest accuracy for each scenario is highlighted in bold.
Method Dynamic Dynamic-S Non-I.I.D. Multi-non-I.I.D. Avg. \uparrow
Source 57.39 57.39 57.39 57.39 57.39
ADFT 57.99 63.02 54.46 53.10 57.14
DABN 62.45 69.90 41.59 33.79 51.93
ADFT+DABN 69.55 72.33 67.51 63.85 68.31

4.3 Ablation Study

To evaluate the contributions of different modules, we conducted experiments on the CIFAR10-C dataset with severity level 5, using a batch size of 64. The performance was assessed using a ResNet-50 model across four different scenarios: Dynamic, Dynamic-S, Non-I.I.D., and Multi-non-I.I.D. The results, measured by accuracy (%), are summarized in Tab. 4. Our results demonstrate that the combination of ADFT and DABN modules achieves the highest accuracy in all scenarios, with an average accuracy improvement of up to 68.31%. This indicates the effectiveness of integrating both modules for enhancing robustness under diverse conditions.

5 Related Work

5.1 Unsupervised Domain Adaptation

Traditional unsupervised learning copes with changes in distribution by jointly optimizing a model on the labeled source and unlabelled target data, e.g., by designing a domain discriminator to learn domain-invariant features [20, 23, 37]. During training, unsupervised domain adaptation approaches often utilize difference loss [long2015learning] or adversarial training [17][18] to align the feature distribution between two domains. In recent years, in order to avoid access to the source data, some authors have proposed passive unsupervised domain adaptation methods based on generative models [33, 21] or information maximization [21]. However, these aforementioned unsupervised domain adaptation methods optimize the model offline through multiple rounds of training.

5.2 Test-time Adaptation

Test-time adaptation (TTA) attempts to adapt the pre-trained model without access to the source data [10, 34, 32, 31, 15, 7, 2, 30]. In some papers, TTA is also referred to as Source-Free Unsupervised Domain Adaptation (SFUDA). TENT [26] used entropy minimization to adjust the parameters in batch normalization layers to optimize the confidence of models during testing. Then, some previous studies [27, 16, 4, 24] minimized error accumulation and reduced catastrophic forgetting by fine-tuning the parameters and outputs with every iteration. For non-i.i.d. samples under which most previous TTA methods often fail, NOTE [6] present Instance-Aware Batch Normalization (IABN) to normalize the out-of-distribution samples and Prediction-balanced Reservoir Sampling (PBRS) to simulates i.i.d. data stream. RoTTA [35] presents a robust batch normalization scheme to estimate the normalization statistics, utilize a memory bank to sample category-balanced data and develop a time-aware re-weighting strategy with a teacher-student model. TTAB [38] presents a test-time adaptation benchmark to evaluate algorithms.

6 Conclusion

This paper presents a novel one-size-fits-all solution, Diversity Adaptive Test Time Adaptation (DATTA), which aims to adaptively select appropriate batch normalization methods and back-propagation methods based on scenarios. It utilises a Diversity Score-based evaluation of each data batch to dynamically adapt the BN method, enabling the model to achieve a more robust representation that effectively adapts to both static and dynamic data patterns. Our DATTA method incorporates Diversity Discriminant (DD), Diversity Adaptive Batch Normalization (DABN) and Diversity Adaptive Fine-tuning (DAFT), which helps to prevent unwanted and even potentially harmful back-propagation. Experimental results validate the robustness and effectiveness of DATTA, demonstrating its ability to maintain stable model performance while adapting to changes in data flow patterns.

References

  • [1] Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., et al.: Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046 (2019)
  • [2] Chen, D., Wang, D., Darrell, T., Ebrahimi, S.: Contrastive Test-Time Adaptation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 295–305. IEEE, New Orleans, LA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.00039
  • [3] Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11580–11590 (2021)
  • [4] Gan, Y., Bai, Y., Lou, Y., Ma, X., Zhang, R., Shi, N., Luo, L.: Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation (2023)
  • [5] Gong, T., Jeong, J., Kim, T., Kim, Y., Shin, J., Lee, S.J.: Note: Robust continual test-time adaptation against temporal correlation. Advances in Neural Information Processing Systems 35, 27253–27266 (2022)
  • [6] Gong, T., Jeong, J., Kim, T., Kim, Y., Shin, J., Lee, S.J.: NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation (2023)
  • [7] Goyal, S., Sun, M., Raghunathan, A., Kolter, Z.: Test-Time Adaptation via Conjugate Pseudo-labels
  • [8] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
  • [9] Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)
  • [10] Iwasawa, Y., Matsuo, Y.: Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization
  • [11] Kairouz, P., McMahan, H.B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A.N., et al.: Advances and open problems in federated learning. In: Advances in Neural Information Processing Systems. pp. 11769–11780 (2019)
  • [12] LEARNING, T.S.I.M.: Dataset shift in machine learning
  • [13] Liu, J., Yang, S., Jia, P., Zhang, R., Lu, M., Guo, Y., Xue, W., Zhang, S.: Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344 (2023)
  • [14] Nado, Z., Padhy, S., Sculley, D., D’Amour, A., Lakshminarayanan, B., Snoek, J.: Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift (2021)
  • [15] Nguyen, A.T., Nguyen-Tang, T., Lim, S.N., Torr, P.H.: TIPI: Test Time Adaptation with Transformation Invariance. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24162–24171. IEEE, Vancouver, BC, Canada (2023). https://doi.org/10.1109/CVPR52729.2023.02314
  • [16] Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient Test-Time Model Adaptation without Forgetting
  • [17] Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient test-time model adaptation without forgetting. In: The Internetional Conference on Machine Learning (2022)
  • [18] Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400 (2023)
  • [19] Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: TOWARDS STABLE TEST-TIME ADAPTATION IN DYNAMIC WILD WORLD (2023)
  • [20] Pei, Z., Cao, Z., Long, M., Wang, J.: Multi-adversarial domain adaptation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
  • [21] Qiu, Z., Zhang, Y., Lin, H., Niu, S., Liu, Y., Du, Q., Tan, M.: Source-free domain adaptation via avatar prototype generation and adaptation. arXiv preprint arXiv:2106.15326 (2021)
  • [22] Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International conference on machine learning. pp. 5389–5400. PMLR (2019)
  • [23] Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3723–3732 (2018)
  • [24] Song, J., Lee, J., Kweon, I.S., Choi, S.: EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11920–11929. IEEE, Vancouver, BC, Canada (2023). https://doi.org/10.1109/CVPR52729.2023.01147
  • [25] Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)
  • [26] Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully Test-time Adaptation by Entropy Minimization (2021)
  • [27] Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: Proceedings of Conference on Computer Vision and Pattern Recognition (2022)
  • [28] Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual Test-Time Domain Adaptation (2022)
  • [29] Wang, W., Zhong, Z., Wang, W., Chen, X., Ling, C., Wang, B., Sebe, N.: Dynamically instance-guided adaptation: A backward-free approach for test-time domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24090–24099 (2023)
  • [30] Wu, C., Pan, Y., Li, Y., Wang, J.Z.: Learning to Adapt to Online Streams with Distribution Shifts (2023)
  • [31] Wu, Q., Yue, X., Sangiovanni-Vincentelli, A.: Domain-agnostic Test-time Adaptation by Prototypical Training with Auxiliary Data
  • [32] Yang, H., Chen, C., Jiang, M., Liu, Q., Cao, J., Heng, P.A., Dou, Q.: DLTTA: Dynamic Learning Rate for Test-Time Adaptation on Cross-Domain Medical Images. IEEE Transactions on Medical Imaging 41(12), 3575–3586 (2022). https://doi.org/10.1109/TMI.2022.3191535
  • [33] Yang, S., Wang, Y., Herranz, L., Jui, S., van de Weijer, J.: Casting a bait for offline and online source-free domain adaptation. Computer Vision and Image Understanding 234, 103747 (2023)
  • [34] You, F., Li, J., Zhao, Z.: Test-time batch statistics calibration for covariate shift (2021)
  • [35] Yuan, L., Xie, B., Li, S.: Robust Test-Time Adaptation in Dynamic Scenarios (2023)
  • [36] Yuan, L., Xie, B., Li, S.: Robust test-time adaptation in dynamic scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15922–15932 (2023)
  • [37] Zhang, Y., Hooi, B., Hong, L., Feng, J.: Test-agnostic long-tailed recognition by test-time aggregating diverse experts with self-supervision. arXiv preprint arXiv:2107.09249 2(5),  6 (2021)
  • [38] Zhao, H., Liu, Y., Alahi, A., Lin, T.: On Pitfalls of Test-Time Adaptation (2023)