¹¹institutetext: Shenzhen Technology University, Shenzhen, China
²²institutetext: Tsinghua University, Shenzhen, China
²²email: youngyorkye@gmail.com, 20210080214@stumail.sztu.edu.cn

DATTA: Towards Diversity Adaptive Test-Time Adaptation in Dynamic Wild World

Chuyang Ye^* 11 Dongyan Wei^* 11 Zhendong Liu 11 Yuanyi Pang 11 Yixi Lin 11 Jiarong Liao 11 Qinting Jiang 22 Xianghua Fu 11 Qing Li 11 Jingyan Jiang^(✉) 11

Abstract

Test-time adaptation (TTA) effectively addresses distribution shifts between training and testing data by adjusting models on test samples, which is crucial for improving model inference in real-world applications. However, traditional TTA methods typically follow a fixed pattern to address the dynamic data patterns (low-diversity or high-diversity patterns) often leading to performance degradation and consequently a decline in Quality of Experience (QoE). The primary issues we observed are: 1) Different scenarios require different normalization methods (e.g., Instance Normalization (IN) is optimal in mixed domains but not in static domains). 2) Model Fine-Tuning can potentially harm the model and waste time. Hence, it is crucial to design strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance. Based on these observations, this paper proposes a new general method, named Diversity Adaptive Test-Time Adaptation (DATTA), aimed at improving QoE. DATTA dynamically selects the best batch normalization methods and fine-tuning strategies by leveraging the Diversity Score to differentiate between high and low diversity score batches. It features three key components: Diversity Discrimination (DD) to assess batch diversity, Diversity Adaptive Batch Normalization (DABN) to tailor normalization methods based on DD insights, and Diversity Adaptive Fine-Tuning (DAFT) to selectively fine-tune the model. Experimental results show that our method achieves up to a 21% increase in accuracy compared to state-of-the-art methodologies, indicating that our method maintains good model performance while demonstrating its robustness. Our code will be released soon.

Keywords:

Quality of Experience Test-time Adaptation Test-time Normalization Domain Generalization Domain Adaptation.

^†^† * Equal Contribution
✉ Corresponding Authors

1 Introduction

Despite the considerable progress made with deep neural networks (DNNs), models trained on a source domain often experience a significant drop in performance when tested in a different environment (e.g., target domains) [12, 8, 3, 22]. Such changes in data distribution—caused by factors like different camera sensors, weather conditions, or geographic regions—lead to a decline in inference service performance, resulting in poor Quality of Experience (QoE) for users. This performance degradation can even lead to critical failures, especially in high-stakes applications such as autonomous driving and mobile healthcare [1, 11]. To address this issue, Test-Time Adaptation (TTA) seeks to adapt models online without the source datasets and ground truth labels of test data streams [26].

Existing TTA methods typically involve two steps: 1) (Re-)correcting Batch Normalization Statistics: Various batch normalization techniques are used to adjust batch normalization statistics. Examples include Source Batch Normalization (SBN) [9], Test-Time Batch Normalization (TBN) [26], and methods based on Instance Normalization (IN) [6, 29]. 2) Fine-tuning Model Parameters: This can be done through partial updating optimization (adjusting affine parameters of models using self-supervision losses, such as entropy loss [26, 6, 35]) or fully backward optimization (adjusting all parameters of models [27]).

However, previous TTA studies have mainly focused on the static data streams where the test data stream changes slightly and test samples within a batch are drawn from one data distribution, referred to as low-diversity pattern [26, 28, 6, 35]. However, in real-world applications, the test data stream often exhibits a dynamic nature: within a batch of data, the test samples can come from one or multiple different data sources, referred to as the high-diversity pattern. This pattern of test data streams poses significant challenges for maintaining the QoE of the intelligent services, as traditional TTA methods may not be robust enough to handle such scenarios effectively.

Traditional TTA methods with fixed strategies struggle to address the dynamic data streams characterized by high-diversity patterns. As analyzed in §Sec. 2.2, our measurements reveal several key insights:

•

Specific Batch Normalization techniques are inadequate for dynamic data patterns. When test samples are of low diversity, the use of TBN to correct SBN statistics can enhance performance. However, when test samples have high diversity, IN proves more effective in handling diverse data distributions.
•

Back-propagation can be a double-edged sword in dynamic data patterns. For test samples with high-diversity, the back-propagation process can significantly decrease accuracy. Conversely, this process can improve accuracy when data are sampled with low-diversity.

Refer to caption — Figure 1: DATTA overview. DATTA consists of three modules: DD takes advantage of an Instance-Normalization-guided projection to capture the data features. Based on the discrimination results, DABN and AMF conduct an adaptive BN re-correcting and model fine-tuning strategy.

Therefore, it is crucial to design strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance. Motivated by these observations, we propose a one-size-fits-all approach (see Fig. 1, called Diversity Adaptive Test-Time Adaptation (DATTA). The main idea of DATTA (Diversity Adaptive Test-Time Adaptation) is to distinguish between high and low diversity score batches by calculating the diversity score of data batches in dynamic scenarios. By adaptively adjusting the batch normalization method and fine-tuning model parameters according to the characteristics of the data batches, DATTA enhances model robustness. Our DATTA includes three key components: Diversity Discrimination (DD), Diversity Adaptive Batch Normalization (DABN), and Diversity Adaptive Fine-Tuning (DAFT).

In the DD component, we compute a few statistics for the Diversity Score in test batch data to identify high and low diversity batches. In DABN, we introduce a dynamically aggregated batch normalization that considers SBN, TBN, and IN based on the result of DD, enabling the model to obtain a more robust representation. In DAFT, the model dynamically selects data batches for fine-tuning based on the diversity score to prevent error accumulation and crashes. Our contributions can be summarized as follows:

•

Effectiveness. We propose an one-size-fits-all approach, that utilizes diversity score based on angle variance to differentiate between various scenarios. Our DABN empowers the model to achieve a more robust representation, enabling it to adapt effectively in both low and high diversity patterns. Moreover, our method circumvents unnecessary or even harmful model fine-tuning, paving the way for further enhancements. Experiments on benchmark datasets demonstrate robust performance compared to state-of-the-art studies under, with up to a 21% increase in accuracy.
•

Efficiency. We introduce a lightweight distribution discriminator module that can be executed within a single forward propagation. Our methods considerably can transition to a backward-free method in high-diversity data patterns, thereby reducing computational expenses.
•

Promising Results. We conduct experiments on mainstream shift datasets and show that our method demonstrates robustness while maintaining good performance of the model. It can effectively respond to data stream patterns, and the selective model Fine-Tuning approach is more lightweight. Empirical evaluations on benchmark datasets indicate a substantial improvement in performance, achieving up to a 21% increase in accuracy compared to state-of-the-art methodologies.

2 Background and Motivation

2.1 Revisiting TTA

Test-time Adaptation. Let $\mathcal{D}_{\mathcal{S}}=\left\{\mathcal{X}^{\mathcal{S}},\mathcal{Y}\right\}$ denote the source domain data and $\mathcal{D}_{\mathcal{T}}=\left\{\mathcal{X}^{\mathcal{T}},\mathcal{Y}\right\}$ denote the target domain data. Each data instance and corresponding label pair $\left(\mathbf{x}_{i},y_{i}\right)\in\mathcal{X}^{\mathcal{S}}\times\mathcal{Y}$ in the source domain follows a distribution $P_{\mathcal{S}}(\mathbf{x},y)$ . Similarly, each target test sample and its label at test time $t$ , $\left(\mathbf{x}_{t},y_{t}\right)\in\mathcal{X}^{\mathcal{T}}\times\mathcal{Y}$ , follow a distribution $P_{\mathcal{T}}(\mathbf{x},y)$ , with $y_{t}$ unknown to the learner. The standard covariate shift assumption in domain adaptation is $P_{\mathcal{S}}(\mathbf{x})\neq P_{\mathcal{T}}(\mathbf{x})$ and $P_{\mathcal{S}}(y\mid\mathbf{x})=P_{\mathcal{T}}(y\mid\mathbf{x})$ . Unlike traditional domain adaptation, which uses pre-collected $\mathcal{D}_{\mathcal{S}}$ and $\mathcal{X}^{\mathcal{T}}$ , TTA continuously adapts a pre-trained model $f_{\theta}(\cdot)$ from $\mathcal{D}_{\mathcal{S}}$ using only the test sample obtained at time $t$ .

TTA on dynamic streams. Previous TTA methods typically assume that at each time $t$ , each target sample $\left(\mathbf{x}_{t},y_{t}\right)\in\mathcal{X}^{\mathcal{T}}\times\mathcal{Y}$ follows a time-invariant distribution $P_{\mathcal{T}}(\mathbf{x},y)$ , denoted as the low-diversity pattern. However, in many real-world scenarios, the data obtained at test time are dynamic and come from multiple sources. Specifically, the data may be drawn from one or multiple distributions $\{P_{\mathcal{T}}^{i}\}_{i=1}^{M}$ , denoted as the high-diversity pattern.

2.2 Motivation Observations

In this section, we explore the performance of current TTA methods in dynamic scenarios. Our findings indicate that traditional static Batch Normalization (BN) designs and static fine-tuning methods are inadequate for adapting to these dynamic environments. Additionally, we investigate how increasing diversity in data distributions in dynamic scenarios exacerbates the impact on performance.

Observation 1: No One-size-fits-all BN methods for different data diversity patterns. We evaluate conventional BN statistics adaptation in TTA methods, including SBN, TBN, BN stats [34] (a combination of TBN and SBN where a larger $\alpha$ indicates greater participation of SBN), and Instance Aware Batch Normalization (IABN) [6], under different data diversity patterns with out backward process.

As shown in Fig. 2(a), in the high diversity pattern, all BN methods experience significant performance drops. Specifically, the $\alpha$ -BN method with $\alpha=0.6$ , which performs well in the low diversity pattern, sees a decrease in accuracy of over 10%. Additionally, performance patterns that excel in low-diversity settings cannot maintain their effectiveness in high-diversity scenarios. For instance, the accuracy of $\alpha$ -BN with $\alpha=0.6$ drops significantly, highlighting the challenge of maintaining performance across diverse data distributions. Meanwhile, IABN stands out in high diversity scenarios, demonstrating superior performance with an accuracy of approximately 66.63%. This suggests that when test samples come from multiple domains, both IN and SBN are needed to correct the BN statistics effectively.

Observation 2: Fine-tuning in high-diversity patterns could potentially harm the model. We compare the accuracy of model fine-tuning for CoTTA [27], TENT [25], NOTE [5], SAR [18], RoTTA [36], and ViDA [13] under low-diversity and high-diversity patterns. As shown in Fig. 2, the results demonstrate that when test samples originate from multiple distributions, NOTE conducting model fine-tuning can lead to a performance reduction of over $3\%$ (TENT reduces nearly $2\%$ ). The potential reason for this could be the accumulation of errors caused by erroneous pseudo-labels, leading to model collapse.

Challenge: How to measure distribution diversity and mitigate its impact on performance? Distribution diversity is defined as the number of different domains from which test data within a single batch are drawn at a given test time. The more distributions present, the greater the diversity. Fig. 2(c) illustrates that increasing the number of domains generally leads to a decrease in accuracy across all methods. For example, both SAR and TENT exhibit significant performance declines as the number of domains increases: SAR drops from 75% to 45%, and TENT falls from 70% to 50%.

To address this challenge, it is crucial to develop strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance.

3 Proposed Methods

Based on the above analysis, the key challenge lies in distinguishing different diversity patterns. To address this, we introduce a diversity discrimination module, detailed in §Sec. 3.1.2, which effectively indicates the degree of diversity, as illustrated in Fig. 2(d). With a diversity score provided by this module, we can intuitively adjust the BN statistics in various ways and design adaptive fine-tuning mechanisms accordingly, detailed in §Sec. 3.2 and §Sec. 3.3.

3.1 Diversity Discrimination (DD)

3.1.1 Diversity Score.

We evaluate batch data distribution diversity by measuring the angle dispersion between the feature map and the TBN mean statistics, using the SBN mean as the reference point. This approach captures how much batch data deviates from the source domain distribution, with greater angle dispersion indicating higher diversity.

We define each activation value in the feature map $f$ , generated by the model’s first convolutional layer. Each activation value represents the response of a specific filter to a local region of the input image. We assume the average of feature values in the test-time batch normalization as $\mu_{\text{test}}$ and the average of feature values during the training time of the source model as $\mu_{\text{source}}$ .

To quantify this, we introduce the following definitions:

Definition 1.

Discrepancy Angle: The data discrepancy angle $\theta$ quantifies the difference between the feature vector $v_{f}$ and the test distribution vector $v_{t}$ . It is defined as:

\theta=\cos^{-1}\left(\frac{v_{f}\cdot v_{t}}{\|v_{f}\|\|v_{t}\|}\right).

Here, the feature vector $v_{f}$ represents the difference between the source domain mean and the feature map $f$ : $v_{f}=\mu_{\text{source}}-f$ . Similarly, the test distribution vector $v_{t}$ is defined as the difference between the source domain mean and the test-time batch mean: $v_{t}=\mu_{\text{source}}-\mu_{\text{test}}$ .

The diversity score $S$ is defined as the variance of the angles $\theta$ within each batch. It is calculated as follows:

\mathit{S}=\frac{1}{N}\sum_{i=1}^{N}(\theta-\overline{\theta})^{2},

(1)

where $\overline{\theta}$ is the mean of all calculated angles $\theta_{i}$ within the batch, and $N$ is the number of samples in the batch.

This method allows us to effectively measure the diversity of data distribution in batch processing, providing a robust metric for analysis without significant computational costs.

3.1.2 Adaptive Discrimination with Diversity Score.

The adaptive discrimination with diversity scores is designed to dynamically distinguish between high-diversity and low-diversity batches in a data stream using the diversity score. This mechanism includes a module called the Diversity Cache, which collects diversity scores during test time. At each time step $t$ , the Diversity Cache stores the diversity score $\mathit{S}_{t}$ of the current test data samples and calculates the diversity threshold $Q_{t}$ dynamically.

The diversity threshold $Q_{t}$ at test time $t$ is calculated as follows:

Q_{t}=P_{\lambda}(\{\mathit{S}_{1},\mathit{S}_{2},\ldots,\mathit{S}_{t}\}),

(2)

where $P_{\lambda}$ denotes the $\lambda$ -th percentile function.

In practice, during the initial stages of test-time adaptation, the diversity cache begins by collecting diversity scores from the data stream over a period denoted as $T_{\text{init}}$ . This cold start phase provides a preliminary assessment of the data distribution for the current service. The diversity scores gathered during this period are utilized to compute an initial diversity threshold, which is instrumental in distinguishing between high-diversity and low-diversity batches within the data stream. After $T_{\text{init}}$ , the diversity cache continues to collect diversity scores at each step and dynamically updates the diversity threshold using these scores. This continuous update allows the system to flexibly adjust the identification of high-diversity and low-diversity batches based on real-time data.

3.2 Diversity Adaptive Batch Normalization (DABN)

As outlined in §Sec. 2.2, high-diversity scores indicate significant variability, making it difficult for methods suited for low-diversity data to normalize feature maps using test-time statistics. Conversely, low-diversity scores suggest a more concentrated data distribution, where strong corrections using instance normalization statistics can hinder normalization and over-correction can fail to remove uninformative variations.

To address these issues, we propose DABN, which effectively manages varying diversity scores. DABN reduces excessive corrections of BN statistics in low diversity score batches while maintaining robust performance in high diversity score batches. This method also mitigates the issue of internal covariate shift, where modified BN layer outputs diverge from the model’s outputs trained on the source domain. DABN incorporates BN statistics ${\mu}_{source}$ and ${\sigma}^{2}_{source}$ from extensive source domain training into the prediction process. By applying different correction strategies based on the diversity score of the data batch, DABN minimizes internal covariate shifts, thereby improving prediction accuracy.

Drawing from insights in IABN [6], we assume that the sample mean and sample variance follow a sampling distribution with a sample size of $L$ , represented by a normal distribution. The variances of the sample mean $s_{\mu}$ and sample variance $s_{\sigma}$ are given by:

s_{\mu}=\frac{\sigma_{source}^{2}}{L},\quad s_{\sigma}=\frac{2\sigma_{source}^% {4}}{L-1}.

(3)

In high-diversity score batches, DABN adjusts the instance normalization statistics $\mu_{instance}$ and $\sigma^{2}_{instance}$ to align with the source domain batch normalization statistics $\mu_{source}$ and $\sigma^{2}_{source}$ . For low-diversity score batches, DABN primarily relies on the current batch’s batch normalization statistics $\mu_{test}$ and $\sigma^{2}_{test}$ to adapt to the current data distribution while mitigating internal covariate shift. Specifically, we use the following statistics:

\mu_{DABN}=\mu_{source}+\alpha\cdot\psi(\mu_{instance};\mu_{test};\mu_{source}% ;\kappa s_{\mu}),

(4)

\sigma_{DABN}^{2}=\sigma_{source}^{2}+\alpha\cdot\psi(\sigma_{instance}^{2};% \sigma_{test}^{2};\sigma_{source}^{2};\kappa s_{\sigma}),

(5)

where the function $\psi$ is used to adjust the alignment between the instance and source statistics based on the diversity score $\mathit{S}_{t}$ .

\psi(x;y;z;\kappa)=\begin{cases}0,&\text{if }x-z=\kappa\text{ and }\mathit{S}_% {t}>Q_{t},\\ x-z-\kappa,&\text{if }x-z>\kappa\text{ and }\mathit{S}_{t}\geq Q_{t},\\ x-z+\kappa,&\text{if }x-z<\kappa\text{ and }\mathit{S}_{t}\geq Q_{t},\\ y-z,&\text{if }\mathit{S}_{t}<Q_{t}.\end{cases}

(6)

Here, $\alpha$ is a hyperparameter of DABN determining the adjustment level of current batch information, and $\kappa$ determines the confidence level of source domain statistics.

In summary, DABN is described as follows:

\textbf{DABN}:=\gamma\cdot\frac{\mathbf{f}-\mu_{\text{DABN}}}{\sqrt{\sigma_{% \text{DABN}}^{2}+\epsilon}}+\beta.

(7)

3.3 Diversity Adaptive Fine-Tuning (DAFT)

After updating the BN layer’s statistical values, the model’s affine parameters must be adjusted accordingly. However, not all updates are effective or reliable. Our experiments and analysis indicate that parameter updates are ineffective and potentially detrimental when data comes from high-diversity score batches. Therefore, updates should be applied only when the batch data has a low diversity score to avoid wasteful and harmful adjustments. Following this principle, the model updates parameters exclusively for low-diversity score batches.

The loss function is defined as follows:

\mathcal{L}=\mathbb{I}_{\left\{S_{t}>Q_{t}\right\}}\operatorname{Ent}_{% \boldsymbol{\theta}}(\mathbf{x}),

(8)

where $\operatorname{Ent}_{\boldsymbol{\theta}}(\mathbf{x})$ is the cross-entropy loss, $\mathbf{x}$ is the model input, and $\mathbb{I}_{\left\{S>Q\right\}}$ is the indicator function that equals 1 if the diversity score $S$ is greater than the threshold $Q$ , and 0 otherwise. This ensures that parameter updates are only performed when the batch data has a low diversity score, thereby avoiding wasteful and potentially harmful updates when the diversity score is high.

4 Experiments

4.1 Experimental Setup

We achieved the proposed method DATTA and baselines on the TTAB framework [38]. Detailed deployment information, including hyperparameter settings for each baseline, the datasets used, and the software and hardware environment, is provided below.

Environment. The experiments mentioned in this article were carried out utilizing an NVIDIA GeForce RTX 4090 GPU. The experimental code was developed using PyTorch 1.10.1 and Python 3.9.7.

Hyperparameter Configurations. The hyperparameters are divided into two categories: those shared by all baselines and those specific to each method. 1) The shared hyperparameters for model adaptation are as follows: the optimizer used is SGD, the learning rate (LR) is $0.0001$ , and the batch size for all test-time adaptations is set to $64$ . After the test data is input into the model, all data is first forwarded once to obtain the inference result. 2) The hyperparameters specific to each method are set according to the following references: the hyperparameters for TBN follow the settings in [26]; the hyperparameters for IABN are based on the settings in [6]; and the hyperparameters for $\alpha$ -BN also follow the settings in [26]. Specifically, For DABN, $\alpha$ is a hyperparameter that determines the adjustment level based on the current batch information, which is set to 0.2 in our experiments. Additionally, $\kappa$ determines the confidence level of the source domain statistics and is set to 4.

Baselines. We compare our method with various cutting-edge TTA methods. Source assesses the model trained on source data directly on target data without adaptation. BN Stats [14] combines TBN and SBN statistics for updated BN layer statistics. TENT [26] minimizes prediction entropy to boost model confidence, estimates normalization statistics, and updates channel-wise affine transformations online. EATA [17] filters high-entropy samples and uses a Fisher regularizer to stabilize updates and prevent catastrophic forgetting. CoTTA [27] uses weight and augmentation averaging to reduce errors and randomly resets neurons to pre-trained weights to retain knowledge. NOTE [6] corrects normalization for out-of-distribution samples and simulates an i.i.d. data stream from a non-i.i.d. stream. SAR [19] selectively minimizes entropy by excluding noisy samples and optimizing entropy and surface sharpness for stability. RoTTA [35] simulates an i.i.d. data stream by constructing a sampling pool and adapting BN layer statistics. ViDA [13] decomposes features into high-rank and low-rank components for knowledge sharing. We assume the source data is inaccessible during TTA. The model is continuously updated online, without modifying BN during training. Following [38, 6], we use a test batch size of 64 and perform a single adaptation epoch, with method-specific hyperparameters as reported in their papers or official implementations [38].

Datasets. We use the CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets [8] from the TTA benchmark (TTAB) [38] to evaluate model robustness against corruptions. These datasets include 15 corruption types (e.g., Gaussian noise, shot noise, impulse noise, defocus blur) with 5 severity levels each. CIFAR-10-C and CIFAR-100-C are small-scale datasets with 10 and 100 classes respectively, containing 10,000 images per corruption type. ImageNet-C is a large-scale dataset with 1,000 classes and 50,000 images per corruption type.

Scenarios. In our experiments, we utilized four scenarios: Dynamic, Dynamic-S, Non-I.I.D., and Multi-non-I.I.D. The Dynamic scenario involves each batch of input samples being composed of data from several different distributions, leading to a high diversity scenario while being independent and identically distributed (i.i.d.). The Dynamic-S scenario features input samples within each batch that are i.i.d. but either come from multiple domains (resulting in high diversity) or from a single domain (resulting in low diversity). In the Non-I.I.D. scenario, we mix data from 15 domains, and each time the data is input in the form of a mixed domain, causing the domain of the test samples to be unstable and change in real time, representing a low diversity situation within each batch. The Multi-non-I.I.D. scenario, similar to Dynamic, uses batches composed of data from multiple distributions; however, like Non-I.I.D., it mixes data from 15 different domains, resulting in an unstable and dynamically changing domain for the test samples.

Tab. 1: Comparison of state-of-the-art methods on CIFAR10-C, CIFAR100-C, and ImageNet-C at severity level 5 with a Batch Size of 64 under Dynamic and Dynamic-S scenarios, evaluated by Accuracy (%). The bold value signifies the best result, and the second best accuracy is underlined.

		Dynamic				Dynamic-S
Method	Venue	CIFAR10-C	CIFAR100-C	ImageNet-C	Avg. $\uparrow$	CIFAR10-C	CIFAR100-C	ImageNet-C	Avg. $\uparrow$	Avg-All $\uparrow$
Source	CVPR’16	57.39	28.59	25.96	37.31	57.38	28.58	25.77	37.24	37.28
BN Stats	ICLR’21	62.41	33.23	22.37	39.33	69.86	41.62	25.42	45.63	42.48
TENT	CVPR’21	63.05	32.48	18.47	38.00	71.58	41.13	23.71	45.47	41.73
EATA	ICML’22	59.97	34.52	19.35	37.94	67.97	41.95	23.05	44.33	41.13
NOTE	NIPS 22	61.48	31.91	20.56	37.98	66.58	24.77	21.32	37.55	37.77
CoTTA	CVPR’22	48.50	20.26	5.43	24.73	50.31	22.01	25.29	32.54	28.63
SAR	ICLR’23	62.60	31.80	18.42	37.60	71.03	40.78	27.73	46.51	42.06
RoTTA	CVPR’23	48.96	21.88	22.03	30.95	49.21	24.22	23.90	32.44	31.70
ViDA	ICLR’24	59.94	30.75	18.78	36.49	67.95	39.47	10.20	39.21	37.85
Ours	Proposed	69.55	36.24	25.05	43.61	72.33	40.83	28.78	47.31	45.46

Tab. 2: Comparison of latency (s) for processing CIFAR-10-C, CIFAR-100-C, and ImageNet-C using a single RTX4090 GPU on ResNet-50.

Method	Venue	CIFAR10-C	CIFAR100-C	ImageNet-C
Source	CVPR’16	0.007	0.007	0.051
BN Stats	ICLR’21	0.018	0.018	0.068
TENT	CVPR’21	0.068	0.070	0.169
EATA	ICML’22	0.060	0.070	0.170
NOTE	NIPS’22	2.142	2.190	1.896
CoTTA	CVPR’22	0.543	0.541	5.322
SAR	ICLR’23	0.094	0.094	0.295
RoTTA	CVPR’23	0.297	0.297	0.603
ViDA	ICLR’24	0.532	0.530	5.236
Ours	Proposed	0.029	0.029	0.074

Tab. 3: Comparison of state-of-the-art methods on CIFAR10-C, CIFAR100-C, and ImageNet-C at severity level 5 with a Batch Size of 64 under Non-I.I.D. and Multi-non-I.I.D. scenarios, evaluated by Accuracy (%). The bold value signifies the best result, and the second best accuracy is underlined.

		Non-I.I.D.				Multi-non-I.I.D.
Method	Venue	CIFAR10-C	CIFAR100-C	ImageNet-C	Avg. $\uparrow$	CIFAR10-C	CIFAR100-C	ImageNet-C	Avg. $\uparrow$	Avg-All $\uparrow$
Source	CVPR’16	57.39	28.59	25.77	37.25	57.39	28.59	25.84	37.27	37.26
BN Stats	ICLR’21	27.32	13.11	22.46	20.96	24.84	18.44	18.49	20.59	20.77
TENT	CVPR’21	24.40	11.69	22.89	19.66	20.13	14.92	18.13	17.72	18.69
EATA	ICML’22	27.43	5.33	24.23	18.99	24.80	9.97	20.51	18.42	18.71
NOTE	NIPS’22	64.98	26.29	11.67	34.31	63.24	24.75	10.26	32.74	33.53
CoTTA	CVPR’22	20.08	8.73	9.04	12.61	19.11	13.22	5.44	12.58	12.60
SAR	ICLR’23	24.78	9.36	23.11	19.08	20.43	14.00	18.35	17.59	18.33
RoTTA	CVPR’23	57.47	37.03	26.58	40.36	40.09	22.24	22.26	28.19	34.27
ViDA	ICLR’24	27.50	12.92	22.53	20.98	24.77	18.22	18.55	20.51	20.74
Ours	Proposed	69.16	35.50	29.14	44.60	67.83	35.47	24.74	42.68	43.64

4.2 Experimental Results and Analysis under Different Scenarios

We performed experiments in four separate scenarios: Dynamic, Dynamic-S, and Non-I.I.D. In alignment with the configurations used in prior studies, we selected the most severely corrupted samples (level 5) from each type of corruption.

Tab. 1 displays the performance outcomes of various TTA methods in Dynamic and Dynamic-S scenarios. It is clear from the table that our approach significantly outperforms other benchmarks in terms of average accuracy across the two scenarios. Notably, in the Dynamic scenario, our method shows a considerable advantage, achieving an average accuracy approximately 19% higher than the lowest benchmark (CoTTA) and about 4% higher than the highest benchmark (BN stats). This indicates that our method has inherent strengths in managing batch data with multiple distributions. In the Dynamic-S scenario, our average accuracy is around 17% higher than the lowest benchmark (CoTTA) and approximately 3% higher than the highest benchmark (BN Stats). This underscores the effectiveness of our method in handling the static and dynamic patterns.

Tab. 2 compares the latency of state-of-the-art methods under Dynamic and Dynamic-S scenarios. Our method shows competitive efficiency, particularly in terms of latency. For CIFAR10-C, our method’s latency is 0.029 seconds, significantly lower than NOTE (2.142 seconds) and CoTTA (0.543 seconds). Although Source (0.007 seconds) has slightly lower latency, our method remains within an acceptable range for practical use. For CIFAR100-C, our method maintains a latency of 0.029 seconds, much lower than NOTE (2.190 seconds) and CoTTA (0.541 seconds). While Source (0.007 seconds) and BN Stats (0.018 seconds) show lower latencies, our method effectively balances efficiency and accuracy. For ImageNet-C, our method achieves a latency of 0.074 seconds, substantially lower than CoTTA (5.322 seconds) and ViDA (5.236 seconds). Although Source (0.051 seconds) has the best latency, our method still outperforms most benchmarks in this scenario. These results demonstrate that our method provides a strong balance between low latency and high performance, making it suitable for real-time applications where both efficiency and accuracy are crucial. This balance enhances QoE, ensuring optimal service performance and satisfaction for end-users.

Tab. 3 presents the accuracy comparison of TTA methods under Non-I.I.D. and Multi-non-I.I.D. scenarios. In the Non-I.I.D. scenario, our method achieves the highest average accuracy of 44.60%, which is approximately 4% higher than the second-best method (RoTTA) with an average accuracy of 40.36%. Specifically, our method achieves the highest accuracy on CIFAR10-C (69.16%) and ImageNet-C (29.14%), and the second highest accuracy on CIFAR100-C (35.50%). These results indicate a significant improvement in handling domain instability and low diversity within each batch. In the Multi-non-I.I.D. scenario, our method also outperforms other benchmarks with an average accuracy of 42.68%, which is about 5% higher than the second highest benchmark (Source) with an average accuracy of 37.26%. Our method shows the highest accuracy on CIFAR10-C (67.83%) and CIFAR100-C (35.47%), and the second highest accuracy on ImageNet-C (24.74%). Across both scenarios, our method achieves an overall average accuracy of 43.64%, which is about 4% higher than the overall second-best method (Source) with an average accuracy of 37.26%. These results demonstrate that our method not only effectively handles dynamically changing and mixed-domain data but also excels in Non-I.I.D. scenarios.

Tab. 4: Comparison of different modules’ performance on CIFAR10-C datasets (severity level 5) with a Batch Size of 64, evaluated by Accuracy (%). Each method was tested with a ResNet-50 model under Dynamic, Dynamic-S, Non-I.I.D. and Multi-non-I.I.D. scenarios. The highest accuracy for each scenario is highlighted in bold.

Method	Dynamic	Dynamic-S	Non-I.I.D.	Multi-non-I.I.D.	Avg. $\uparrow$
Source	57.39	57.39	57.39	57.39	57.39
ADFT	57.99	63.02	54.46	53.10	57.14
DABN	62.45	69.90	41.59	33.79	51.93
ADFT+DABN	69.55	72.33	67.51	63.85	68.31

4.3 Ablation Study

To evaluate the contributions of different modules, we conducted experiments on the CIFAR10-C dataset with severity level 5, using a batch size of 64. The performance was assessed using a ResNet-50 model across four different scenarios: Dynamic, Dynamic-S, Non-I.I.D., and Multi-non-I.I.D. The results, measured by accuracy (%), are summarized in Tab. 4. Our results demonstrate that the combination of ADFT and DABN modules achieves the highest accuracy in all scenarios, with an average accuracy improvement of up to 68.31%. This indicates the effectiveness of integrating both modules for enhancing robustness under diverse conditions.

5 Related Work

5.1 Unsupervised Domain Adaptation

Traditional unsupervised learning copes with changes in distribution by jointly optimizing a model on the labeled source and unlabelled target data, e.g., by designing a domain discriminator to learn domain-invariant features [20, 23, 37]. During training, unsupervised domain adaptation approaches often utilize difference loss [long2015learning] or adversarial training [17][18] to align the feature distribution between two domains. In recent years, in order to avoid access to the source data, some authors have proposed passive unsupervised domain adaptation methods based on generative models [33, 21] or information maximization [21]. However, these aforementioned unsupervised domain adaptation methods optimize the model offline through multiple rounds of training.

5.2 Test-time Adaptation

Test-time adaptation (TTA) attempts to adapt the pre-trained model without access to the source data [10, 34, 32, 31, 15, 7, 2, 30]. In some papers, TTA is also referred to as Source-Free Unsupervised Domain Adaptation (SFUDA). TENT [26] used entropy minimization to adjust the parameters in batch normalization layers to optimize the confidence of models during testing. Then, some previous studies [27, 16, 4, 24] minimized error accumulation and reduced catastrophic forgetting by fine-tuning the parameters and outputs with every iteration. For non-i.i.d. samples under which most previous TTA methods often fail, NOTE [6] present Instance-Aware Batch Normalization (IABN) to normalize the out-of-distribution samples and Prediction-balanced Reservoir Sampling (PBRS) to simulates i.i.d. data stream. RoTTA [35] presents a robust batch normalization scheme to estimate the normalization statistics, utilize a memory bank to sample category-balanced data and develop a time-aware re-weighting strategy with a teacher-student model. TTAB [38] presents a test-time adaptation benchmark to evaluate algorithms.

6 Conclusion

This paper presents a novel one-size-fits-all solution, Diversity Adaptive Test Time Adaptation (DATTA), which aims to adaptively select appropriate batch normalization methods and back-propagation methods based on scenarios. It utilises a Diversity Score-based evaluation of each data batch to dynamically adapt the BN method, enabling the model to achieve a more robust representation that effectively adapts to both static and dynamic data patterns. Our DATTA method incorporates Diversity Discriminant (DD), Diversity Adaptive Batch Normalization (DABN) and Diversity Adaptive Fine-tuning (DAFT), which helps to prevent unwanted and even potentially harmful back-propagation. Experimental results validate the robustness and effectiveness of DATTA, demonstrating its ability to maintain stable model performance while adapting to changes in data flow patterns.

References

[1] Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., et al.: Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046 (2019)
[2] Chen, D., Wang, D., Darrell, T., Ebrahimi, S.: Contrastive Test-Time Adaptation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 295–305. IEEE, New Orleans, LA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.00039
[3] Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11580–11590 (2021)
[4] Gan, Y., Bai, Y., Lou, Y., Ma, X., Zhang, R., Shi, N., Luo, L.: Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation (2023)
[5] Gong, T., Jeong, J., Kim, T., Kim, Y., Shin, J., Lee, S.J.: Note: Robust continual test-time adaptation against temporal correlation. Advances in Neural Information Processing Systems 35, 27253–27266 (2022)
[6] Gong, T., Jeong, J., Kim, T., Kim, Y., Shin, J., Lee, S.J.: NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation (2023)
[7] Goyal, S., Sun, M., Raghunathan, A., Kolter, Z.: Test-Time Adaptation via Conjugate Pseudo-labels
[8] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
[9] Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)
[10] Iwasawa, Y., Matsuo, Y.: Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization
[11] Kairouz, P., McMahan, H.B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A.N., et al.: Advances and open problems in federated learning. In: Advances in Neural Information Processing Systems. pp. 11769–11780 (2019)
[12] LEARNING, T.S.I.M.: Dataset shift in machine learning
[13] Liu, J., Yang, S., Jia, P., Zhang, R., Lu, M., Guo, Y., Xue, W., Zhang, S.: Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344 (2023)
[14] Nado, Z., Padhy, S., Sculley, D., D’Amour, A., Lakshminarayanan, B., Snoek, J.: Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift (2021)
[15] Nguyen, A.T., Nguyen-Tang, T., Lim, S.N., Torr, P.H.: TIPI: Test Time Adaptation with Transformation Invariance. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24162–24171. IEEE, Vancouver, BC, Canada (2023). https://doi.org/10.1109/CVPR52729.2023.02314
[16] Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient Test-Time Model Adaptation without Forgetting
[17] Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient test-time model adaptation without forgetting. In: The Internetional Conference on Machine Learning (2022)
[18] Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400 (2023)
[19] Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: TOWARDS STABLE TEST-TIME ADAPTATION IN DYNAMIC WILD WORLD (2023)
[20] Pei, Z., Cao, Z., Long, M., Wang, J.: Multi-adversarial domain adaptation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
[21] Qiu, Z., Zhang, Y., Lin, H., Niu, S., Liu, Y., Du, Q., Tan, M.: Source-free domain adaptation via avatar prototype generation and adaptation. arXiv preprint arXiv:2106.15326 (2021)
[22] Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International conference on machine learning. pp. 5389–5400. PMLR (2019)
[23] Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3723–3732 (2018)
[24] Song, J., Lee, J., Kweon, I.S., Choi, S.: EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11920–11929. IEEE, Vancouver, BC, Canada (2023). https://doi.org/10.1109/CVPR52729.2023.01147
[25] Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)
[26] Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully Test-time Adaptation by Entropy Minimization (2021)
[27] Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: Proceedings of Conference on Computer Vision and Pattern Recognition (2022)
[28] Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual Test-Time Domain Adaptation (2022)
[29] Wang, W., Zhong, Z., Wang, W., Chen, X., Ling, C., Wang, B., Sebe, N.: Dynamically instance-guided adaptation: A backward-free approach for test-time domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24090–24099 (2023)
[30] Wu, C., Pan, Y., Li, Y., Wang, J.Z.: Learning to Adapt to Online Streams with Distribution Shifts (2023)
[31] Wu, Q., Yue, X., Sangiovanni-Vincentelli, A.: Domain-agnostic Test-time Adaptation by Prototypical Training with Auxiliary Data
[32] Yang, H., Chen, C., Jiang, M., Liu, Q., Cao, J., Heng, P.A., Dou, Q.: DLTTA: Dynamic Learning Rate for Test-Time Adaptation on Cross-Domain Medical Images. IEEE Transactions on Medical Imaging 41(12), 3575–3586 (2022). https://doi.org/10.1109/TMI.2022.3191535
[33] Yang, S., Wang, Y., Herranz, L., Jui, S., van de Weijer, J.: Casting a bait for offline and online source-free domain adaptation. Computer Vision and Image Understanding 234, 103747 (2023)
[34] You, F., Li, J., Zhao, Z.: Test-time batch statistics calibration for covariate shift (2021)
[35] Yuan, L., Xie, B., Li, S.: Robust Test-Time Adaptation in Dynamic Scenarios (2023)
[36] Yuan, L., Xie, B., Li, S.: Robust test-time adaptation in dynamic scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15922–15932 (2023)
[37] Zhang, Y., Hooi, B., Hong, L., Feng, J.: Test-agnostic long-tailed recognition by test-time aggregating diverse experts with self-supervision. arXiv preprint arXiv:2107.09249 2(5), 6 (2021)
[38] Zhao, H., Liu, Y., Alahi, A., Lin, T.: On Pitfalls of Test-Time Adaptation (2023)