(Translated by https://www.hiragana.jp/)
Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised Learning

Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised Learning

Guogang Zhu1    Xuefeng Liu1,2    Xinghao Wu1    Shaojie Tang3    Chao Tang1   
Jianwei Niu1,2111Jianwei Niu is the corresponding author.&Hao Su1 1State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China
2 Zhongguancun Laboratory, Beijing, China
3Jindal School of Management, The University of Texas at Dallas, Richardson, TX, USA
{buaa_zgg, liu_xuefeng, wuxinghao}@buaa.edu.cn, shaojie.tang@utdallas.edu, {sy2106322, niujianwei, bhsuhao}@buaa.edu.cn
Abstract

Federated Semi-Supervised Learning (FSSL) leverages both labeled and unlabeled data on clients to collaboratively train a model. In FSSL, the heterogeneous data can introduce prediction bias into the model, causing the model’s prediction to skew towards some certain classes. Existing FSSL methods primarily tackle this issue by enhancing consistency in model parameters or outputs. However, as the models themselves are biased, merely constraining their consistency is not sufficient to alleviate prediction bias. In this paper, we explore this bias from a Bayesian perspective and demonstrate that it principally originates from label prior bias within the training data. Building upon this insight, we propose a debiasing method for FSSL named FedDB. FedDB utilizes the Average Prediction Probability of Unlabeled Data (APP-U) to approximate the biased prior. During local training, FedDB employs APP-U to refine pseudo-labeling through Bayes’ theorem, thereby significantly reducing the label prior bias. Concurrently, during the model aggregation, FedDB uses APP-U from participating clients to formulate unbiased aggregate weights, thereby effectively diminishing bias in the global model. Experimental results show that FedDB can surpass existing FSSL methods. The code is available at https://github.com/GuogangZhu/FedDB.

1 Introduction

Federated Learning (FL) McMahan et al. (2017) is a distributed learning paradigm that can facilitate collaborative model training among multiple clients while preserving data privacy. Presently, most FL methods are confined to supervised learning (SL) settings, wherein it is presumed that each client maintains a fully labeled dataset. Nevertheless, in real-world applications, data labeling is notably laborious and time-consuming. Therefore, a more realistic case involves each client possessing a mix of unlabeled and labeled data. This specific scenario, known as Federated Semi-Supervised Learning (FSSL), has been explored in various studies Jeong et al. (2021); Lin et al. (2021); Diao et al. (2022) and is garnering increasing interest within the FL research community.

In this study, we focus on an FSSL setting where the data on each client are class-imbalanced. Moreover, it is assumed that both intra-client and inter-client data heterogeneity exist. Specifically, intra-client data heterogeneity implies that both the labeled data and unlabeled data on an individual client originate from diverse distributions. Inter-client data heterogeneity means that the overall distributions across clients are non-independent and identically distributed (Non-IID).

Refer to caption
Figure 1: Class-wise test accuracy on a balanced test dataset, along with the labeled data distribution on an individual client. (a) Test accuracy of local model, (b) Test accuracy of global model. The class indexes are ranked based on the labeled data distribution.

In the described scenario, the model’s prediction can skew to some certain classes during the training, i.e., prediction bias. Figure 1 presents the experimental results conducted in the above scenario, where the overall distributions of labeled and unlabeled data are balanced. It can be observed that due to class imbalance in the local client, the local model’s predictions gradually skew towards the major classes in the local data. More importantly, this prediction bias cannot be alleviated after model aggregation, even if the overall distributions are balanced. Instead, it evolves into a different form of bias due to the influence from other clients. This bias can disrupt the pseudo-labeling process, further creating a ‘vicious cycle’ between pseudo-labeling and local model training.

Existing FSSL methods attribute the above issue to the divergence across clients caused by heterogeneous data and primarily address it by promoting consistency between model parameters or outputs Zhang et al. (2021); Jiang et al. (2022); Liang et al. (2022). However, as both the local and global models are biased, merely constraining their consistency cannot fundamentally mitigate the model prediction bias.

In this paper, we delve into the essential reason for the prediction bias in FSSL from a Bayesian perspective. Based on Bayes’ rule, the model prediction is as follows:

𝒑(𝒚|𝒙)=𝒑(𝒙|𝒚)𝒑(𝒚)𝒑(𝒙),𝒑conditional𝒚𝒙𝒑conditional𝒙𝒚𝒑𝒚𝒑𝒙\bm{p}(\bm{y}|\bm{x})=\frac{\bm{p}(\bm{x}|\bm{y})\bm{p}(\bm{y})}{\bm{p}(\bm{x}% )},bold_italic_p ( bold_italic_y | bold_italic_x ) = divide start_ARG bold_italic_p ( bold_italic_x | bold_italic_y ) bold_italic_p ( bold_italic_y ) end_ARG start_ARG bold_italic_p ( bold_italic_x ) end_ARG , (1)

where 𝒑(𝒚|𝒙)𝒑conditional𝒚𝒙\bm{p}(\bm{y}|\bm{x})bold_italic_p ( bold_italic_y | bold_italic_x ) is the model’s prediction, 𝒑(𝒙|𝒚)𝒑conditional𝒙𝒚\bm{p}(\bm{x}|\bm{y})bold_italic_p ( bold_italic_x | bold_italic_y ) is the class conditional likelihood, 𝒑(𝒚)𝒑𝒚\bm{p}(\bm{y})bold_italic_p ( bold_italic_y ) is the label prior. As shown in Figure 2, both the label prior of local labeled data (i.e., 𝒑l(𝒚)subscript𝒑𝑙𝒚\bm{p}_{l}(\bm{y})bold_italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y )) and unlabeled data (i.e., 𝒑^u(𝒚)subscript^𝒑𝑢𝒚\hat{\bm{p}}_{u}(\bm{y})over^ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_italic_y )) are biased. Consequently, the model can gradually absorb these biases during training. These biases are eventually injected into the global model through model aggregation, causing its output priors 𝒑s(𝒚)subscript𝒑𝑠𝒚\bm{p}_{s}(\bm{y})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_y ) to skew towards certain classes. When conducting inference on a balanced test dataset (i.e., 𝒑t(𝒚)subscript𝒑𝑡𝒚\bm{p}_{t}(\bm{y})bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y )), the model may suffer severe performance degradation, as 𝒑s(𝒚)𝒑t(𝒚)subscript𝒑𝑠𝒚subscript𝒑𝑡𝒚\bm{p}_{s}(\bm{y})\neq\bm{p}_{t}(\bm{y})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_y ) ≠ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_y ).

Refer to caption
Figure 2: Prior bias in class-imbalanced FSSL.

Nevertheless, 𝒑s(𝒚)subscript𝒑𝑠𝒚\bm{p}_{s}(\bm{y})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_y ) is commonly challenging to estimate. On the one hand, in local clients, the ambiguity of pseudo-labels for unlabeled data makes the label prior bias during local training intractable. On the other hand, in the server, model aggregation combines influences from participating clients, further complicating the estimation of prior bias.

Taking the class-wise accuracy on a balanced test dataset as the ground truth for prior bias, we find that the Average Prediction Probability of Unlabeled Data (APP-U) serves as a robust metric to approximate this bias. Figure 3 illustrates the Jensen–Shannon (JS) divergence Lin (1991) between the ground truth bias and either the labeled data distribution or APP-U, where the solid line and shaded area represent the mean and range across clients, respectively. Interestingly, it reveals that for both the global and local models, prior bias does not consistently align with the labeled data distribution. Rather, it shows a stronger correlation with APP-U, indicating that APP-U can effectively quantify the prior bias.

Refer to caption
Figure 3: JS divergence between the ground truth bias and either the labeled data distribution or APP-U on clients. (a) Results on the local model, (b) Results on the global model.

Building upon the above insights, we introduce a hierarchical debiasing method for FSSL termed FedDB, to mitigate the prior bias at both the local training and global aggregation stages. During the local training, FedDB implements debiased pseudo-labeling (DPL) based on Bayes’ theorem, with APP-U serving as the approximation of bias prior. This approach promotes a more balanced pseudo-labeling process for unlabeled data, substantially reducing the label prior bias during local training. At the global aggregation stage, FedDB utilizes APP-U from the participating clients to determine optimal aggregation weights. The above process, termed debiased model aggregation (DMA), effectively mitigates bias within the global model. It should be noted that DPL can be seamlessly integrated with FSSL methods that utilize pseudo-labeling with minimal cost. This demonstrates its substantial potential for practical application of FSSL.

The main contributions of this paper are as follows:

  • We analyze the prediction bias in class-imbalanced FSSL from a Bayesian perspective.

  • We propose FedDB, a Bayesian debiasing method for FSSL that uses APP-U as an approximation of prior bias.

  • We conduct extensive experiments on multiple datasets to demonstrate the effectiveness of FedDB.

2 Related Work

2.1 Federated Learning

Data heterogeneity is a substantial challenge in FL, which can lead to considerable divergence across clients, thereby degrading the model performance Zhao et al. (2018); Li et al. (2020a). To address this issue, various strategies are explored, including reducing the divergence across local models Li et al. (2020b); Acar et al. (2020); Karimireddy et al. (2020), enhancing aggregation schemes Wang et al. (2020); Acar et al. (2020); Reddi et al. (2021), promoting representation consistency across clients Tan et al. (2022); Zhu et al. (2023); Liao et al. (2023), developing personalized models for individual clients Collins et al. (2021); Liu et al. (2023). However, these methods primarily focus on SL settings, which is impractical as data labeling is laborious and time-consuming.

2.2 Semi-Supervised Learning

SSL aims to mitigate the reliance on labeled data, which prompts various mechanisms to leverage the latent information within unlabeled data. Pseudo-labeling Lee and others (2013); Wang et al. (2023), also known as self-training, involves assigning pseudo-labels to unlabeled samples with high confidence, enabling their incorporation into the training process. Consistency regularization Miyato et al. (2018) introduces arbitrary perturbations to unlabeled samples and promotes the consistent predictions between different views of unlabeled data. Additionally, hybrid methods that amalgamate these approaches are also developed, such as MixMatch Berthelot et al. (2019), FixMatch Sohn et al. (2020). Recently, SSL has focused on class imbalance, leading to various studies such as class-rebalancing sampling Wei et al. (2021), and pseudo label sampling Guo and Li (2022). However, simply combining these methods with FL is challenging, as they ignore the collaboration across clients.

2.3 Federated Semi-Supervised Learning

FSSL can be divided into three distinct scenarios Bai et al. (2023): (1) Labels-at-Partial-Clients, where only a few clients have full labels, while the rest possess only unlabeled data Liang et al. (2022); Li et al. (2023); (2) Labels-at-Server, where labeled data are only available at the server, with local clients merely having unlabeled data Zhang et al. (2021); Jeong et al. (2021); Diao et al. (2022); (3) Labels-at-Clients, where each client has mostly unlabeled data and a few labeled samples Jeong et al. (2021); Bai et al. (2023).

This paper focuses on the Labels-at-Clients scenario. Currently, several works have been proposed for this scenario. For instance, SemiFed Lin et al. (2021) assigns pseudo-labels to unlabeled data only when multiple models provide consistent predictions. FedMatch Jeong et al. (2021) enforces prediction consistency across multiple models. However, these methods primarily concentrate on encouraging consistency across clients, overlooking the inherent prior biases within the model — a critical factor leading to performance degradation in FSSL with class imbalance.

3 Preliminary and Background

In this section, we present the notations used in this paper, followed by a detailed discussion of the framework of FSSL.

3.1 Problem Setting and Notation of FSSL

We focus on a FSSL setting for K-class classification task with totally M𝑀Mitalic_M clients participating in the training. Each client m𝑚mitalic_m maintains a labeled dataset 𝒟lm={(𝒙n,𝒚n)}n=1Nlmsuperscriptsubscript𝒟𝑙𝑚superscriptsubscriptsuperscript𝒙𝑛superscript𝒚𝑛𝑛1superscriptsubscript𝑁𝑙𝑚\mathcal{D}_{l}^{m}=\{(\bm{x}^{n},\bm{y}^{n})\}_{n=1}^{N_{l}^{m}}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and an unlabeled dataset 𝒟um={(𝒙n)}n=1Numsuperscriptsubscript𝒟𝑢𝑚superscriptsubscriptsuperscript𝒙𝑛𝑛1superscriptsubscript𝑁𝑢𝑚\mathcal{D}_{u}^{m}=\{(\bm{x}^{n})\}_{n=1}^{N_{u}^{m}}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where Nlmsuperscriptsubscript𝑁𝑙𝑚N_{l}^{m}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and Numsuperscriptsubscript𝑁𝑢𝑚N_{u}^{m}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are the counts of labeled and unlabeled samples, respectively (typically, NumNlmmuch-greater-thansuperscriptsubscript𝑁𝑢𝑚superscriptsubscript𝑁𝑙𝑚N_{u}^{m}\gg N_{l}^{m}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ≫ italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT), 𝒙n𝒳dsuperscript𝒙𝑛𝒳superscript𝑑\bm{x}^{n}\in\mathcal{X}\subseteq\mathbb{R}^{d}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the input sampled from a d𝑑ditalic_d-dimensional space, 𝒚n𝒴{0,1}Ksuperscript𝒚𝑛𝒴superscript01𝐾\bm{y}^{n}\in\mathcal{Y}\subseteq{\{0,1\}}^{K}bold_italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ caligraphic_Y ⊆ { 0 , 1 } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the one-hot label. For clarity, we sometimes omit the superscript denoting the client index in the following contents.

With a slight abuse of notation, we denote Nlksuperscriptsubscript𝑁𝑙𝑘N_{l}^{k}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and Nuksuperscriptsubscript𝑁𝑢𝑘N_{u}^{k}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as the numbers of samples in class k𝑘kitalic_k under 𝒟lsubscript𝒟𝑙\mathcal{D}_{l}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for an arbitrary client, i.e., k=1KNlk=Nlsuperscriptsubscript𝑘1𝐾superscriptsubscript𝑁𝑙𝑘subscript𝑁𝑙\sum_{k=1}^{K}N_{l}^{k}=N_{l}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and k=1KNuk=Nusuperscriptsubscript𝑘1𝐾superscriptsubscript𝑁𝑢𝑘subscript𝑁𝑢\sum_{k=1}^{K}N_{u}^{k}=N_{u}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. In this paper, we assume that both 𝒟lsubscript𝒟𝑙\mathcal{D}_{l}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT exhibit class imbalance, that is, i,j{1,2,,K}𝑖𝑗12𝐾\exists i,j\in\{1,2,\dots,K\}∃ italic_i , italic_j ∈ { 1 , 2 , … , italic_K } for which the ratio NliNljsuperscriptsubscript𝑁𝑙𝑖superscriptsubscript𝑁𝑙𝑗\frac{N_{l}^{i}}{N_{l}^{j}}divide start_ARG italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG is significantly greater than 1111. In other words, the label prior distribution {pl1,,plK}superscriptsubscript𝑝𝑙1superscriptsubscript𝑝𝑙𝐾\{p_{l}^{1},\dots,p_{l}^{K}\}{ italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } shifts from a uniform distribution {1K}Ksuperscript1𝐾𝐾\{\frac{1}{K}\}^{K}{ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. This assumption is similarly applicable for 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Furthermore, we consider the setting that both intra-client and inter-client data heterogeneity exist in the FL system. Intra-client heterogeneity refers to the varied distributions of labeled and unlabeled data within a single client, that is, m{1,2,,M},𝒟um𝒟lmformulae-sequencefor-all𝑚12𝑀superscriptsubscript𝒟𝑢𝑚superscriptsubscript𝒟𝑙𝑚\forall m\in\{1,2,\dots,M\},\mathcal{D}_{u}^{m}\neq\mathcal{D}_{l}^{m}∀ italic_m ∈ { 1 , 2 , … , italic_M } , caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ≠ caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Inter-client heterogeneity, on the other hand, pertains to the dissimilar mixture distributions of both labeled and unlabeled data across clients, i.e., i,j{1,2,M},ij,it holds that𝒟li+𝒟ui𝒟lj+𝒟ujformulae-sequencefor-all𝑖𝑗12𝑀formulae-sequence𝑖𝑗it holds thatsuperscriptsubscript𝒟𝑙𝑖superscriptsubscript𝒟𝑢𝑖superscriptsubscript𝒟𝑙𝑗superscriptsubscript𝒟𝑢𝑗\forall i,j\in\{1,2,\cdots M\},i\neq j,\text{it holds that}\ \mathcal{D}_{l}^{% i}+\mathcal{D}_{u}^{i}\neq\mathcal{D}_{l}^{j}+\mathcal{D}_{u}^{j}∀ italic_i , italic_j ∈ { 1 , 2 , ⋯ italic_M } , italic_i ≠ italic_j , it holds that caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≠ caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT + caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

The final objective of FSSL is to learn a global model f(𝒙;𝒘):𝒳𝒴:𝑓𝒙𝒘𝒳𝒴f(\bm{x};\bm{w}):\mathcal{X}\rightarrow\mathcal{Y}italic_f ( bold_italic_x ; bold_italic_w ) : caligraphic_X → caligraphic_Y parameterized by 𝒘𝒘\bm{w}bold_italic_w that can generalize well to a balanced test dataset whose label prior distribution is {1K}Ksuperscript1𝐾𝐾\{\frac{1}{K}\}^{K}{ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Given the input 𝒙nsuperscript𝒙𝑛\bm{x}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we denote its corresponding output logits as 𝒛(𝒙n):=f(𝒙n;𝒘)assign𝒛superscript𝒙𝑛𝑓superscript𝒙𝑛𝒘\bm{z}(\bm{x}^{n}):=f(\bm{x}^{n};\bm{w})bold_italic_z ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) := italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_w ), and the normalized prediction probability after softmax layer as 𝒑(𝒚|𝒙n):=σしぐま(f(𝒚|𝒙n;𝒘))assign𝒑conditional𝒚superscript𝒙𝑛𝜎𝑓conditional𝒚superscript𝒙𝑛𝒘\bm{p}(\bm{y}|\bm{x}^{n}):=\sigma(f(\bm{y}|\bm{x}^{n};\bm{w}))bold_italic_p ( bold_italic_y | bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) := italic_σしぐま ( italic_f ( bold_italic_y | bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_w ) ), where σしぐま()𝜎\sigma(\cdot)italic_σしぐま ( ⋅ ) is the softmax function. The detailed framework of FSSL is explained as follows.

3.2 Framework of FSSL

During each global round, the server first selects a random subset of clients 𝒮𝒮\mathcal{S}caligraphic_S based on the activation rate C𝐶Citalic_C and broadcasts the global model 𝒘𝒘\bm{w}bold_italic_w to these clients. Subsequently, these clients perform local training for E𝐸Eitalic_E epochs using 𝒘𝒘\bm{w}bold_italic_w as initial weights, resulting in the updated local model 𝒘msubscript𝒘𝑚\bm{w}_{m}bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Finally, the selected clients upload their local models 𝒘msubscript𝒘𝑚\bm{w}_{m}bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to the server for model aggregation. The training paradigms of labeled and unlabeled data in local clients are as follows.

For labeled data, the standard cross-entropy loss is applied to the weakly augmented version of samples to promote the discriminative objective, as shown below:

s=1Nln=1NlH(𝒚n,𝒑(𝒚|αあるふぁ(𝒙n))),subscript𝑠1subscript𝑁𝑙superscriptsubscript𝑛1subscript𝑁𝑙Hsuperscript𝒚𝑛𝒑conditional𝒚𝛼superscript𝒙𝑛\mathcal{L}_{s}=\frac{1}{N_{l}}\sum_{n=1}^{N_{l}}\mathrm{H}(\bm{y}^{n},\bm{p}(% \bm{y}|\alpha(\bm{x}^{n}))),caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_H ( bold_italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ) , (2)

where αあるふぁ()𝛼\alpha(\cdot)italic_αあるふぁ ( ⋅ ) is the weak augmentation function, 𝒑(𝒚|αあるふぁ(𝒙n))𝒑conditional𝒚𝛼superscript𝒙𝑛\bm{p}(\bm{y}|\alpha(\bm{x}^{n}))bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) is the prediction probability for αあるふぁ(𝒙n)𝛼superscript𝒙𝑛\alpha(\bm{x}^{n})italic_αあるふぁ ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), and H(𝒑1,𝒑2)Hsubscript𝒑1subscript𝒑2\mathrm{H}(\bm{p}_{1},\bm{p}_{2})roman_H ( bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is entropy between probability distributions 𝒑1subscript𝒑1\bm{p}_{1}bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒑2subscript𝒑2\bm{p}_{2}bold_italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

For unlabeled data, the samples are pseudo-labeled using the trained model, after which they are incorporated into the training process. Specifically, for a given unlabeled sample 𝒙nsuperscript𝒙𝑛\bm{x}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the model first generates the probability on its weakly augmented version. Then the pseudo-label is calculated by:

𝒚^n=argmax(𝒑(𝒚|αあるふぁ(𝒙n))),superscript^𝒚𝑛𝒑conditional𝒚𝛼superscript𝒙𝑛\hat{\bm{y}}^{n}=\arg\max(\bm{p}(\bm{y}|\alpha(\bm{x}^{n}))),over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = roman_arg roman_max ( bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ) , (3)

where argmax()\arg\max(\cdot)roman_arg roman_max ( ⋅ ) is the function that converts a probability distribution into a one-hot label based on its maximum value.

To enhance the model generalization, the consistency loss is applied to unlabeled data by minimizing the entropy between the pseudo-label and the prediction of its strong augmented version.

During the training, only those unlabeled samples that exhibit high confidence are selected to participate in further training. Consequently, the overall optimization objective for the unlabeled data can be expressed as follows:

u=1Nun=1Nusubscript𝑢1subscript𝑁𝑢superscriptsubscript𝑛1subscript𝑁𝑢\displaystyle\mathcal{L}_{u}=\frac{1}{N_{u}}\sum_{n=1}^{N_{u}}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 𝟙(max(𝒑(𝒚|αあるふぁ(𝒙n)))τたう)\displaystyle\mathbb{1}(\max(\bm{p}(\bm{y}|\alpha(\bm{x}^{n})))\geq\tau)\cdotblackboard_1 ( roman_max ( bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ) ≥ italic_τたう ) ⋅ (4)
H(𝒚^n,𝒑(𝒚|𝒜(𝒙n))),Hsuperscript^𝒚𝑛𝒑conditional𝒚𝒜superscript𝒙𝑛\displaystyle\mathrm{H}(\hat{\bm{y}}^{n},\bm{p}(\bm{y}|\mathcal{A}(\bm{x}^{n})% )),roman_H ( over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_italic_p ( bold_italic_y | caligraphic_A ( bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ) , (5)

where τたう𝜏\tauitalic_τたう is the threshold, 𝟙()1\mathbb{1}(\cdot)blackboard_1 ( ⋅ ) is the indicator function, 𝒜()𝒜\mathcal{A}(\cdot)caligraphic_A ( ⋅ ) is the strong augmentation function.

The overall optimization objective of local training on clients is expressed as:

=s+λらむだu,subscript𝑠𝜆subscript𝑢\mathcal{L}=\mathcal{L}_{s}+\lambda\mathcal{L}_{u},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_λらむだ caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (6)

where λらむだ𝜆\lambdaitalic_λらむだ is used to balance these two loss terms.

After local training, the selected clients send the latest local models to the server for model aggregation, as shown below:

𝒘t+1=1|𝒮t|m𝒮t𝜷m𝒘mt,superscript𝒘𝑡11subscript𝒮𝑡subscript𝑚subscript𝒮𝑡subscript𝜷𝑚superscriptsubscript𝒘𝑚𝑡\bm{w}^{t+1}=\frac{1}{\left|\mathcal{S}_{t}\right|}\sum_{m\in\mathcal{S}_{t}}% \bm{\beta}_{m}\cdot\bm{w}_{m}^{t},bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (7)

where |𝒮t|subscript𝒮𝑡\left|\mathcal{S}_{t}\right|| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | is the number of selected clients in round t𝑡titalic_t, 𝒘mtsubscriptsuperscript𝒘𝑡𝑚\bm{w}^{t}_{m}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the local model on client m𝑚mitalic_m in round t𝑡titalic_t, 𝜷msubscript𝜷𝑚\bm{\beta}_{m}bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the aggregate weight for 𝒘mtsubscriptsuperscript𝒘𝑡𝑚\bm{w}^{t}_{m}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, 𝒘t+1superscript𝒘𝑡1\bm{w}^{t+1}bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is the global model in round t+1𝑡1t+1italic_t + 1.

4 FedDB: Detaching Prior Bias in FSSL

This section details the framework of FedDB and its two key techniques: debiased pseudo-labeling (DPL) and debiased model aggregation (DMA).

4.1 Framework Overview of FedDB

Figure 4 illustrates the framework of FedDB. During the training, each global round consists of the following steps:

  1. (1)

    The server selects a subset of clients for training and broadcasts the global model to these clients;

  2. (2)

    The clients perform inference on unlabeled data and calculate APP-U to estimate the prior bias;

  3. (3)

    The clients perform DPL on unlabeled data using APP-U;

  4. (4)

    The clients train the model utilizing both labeled data and pseudo-labeled data;

  5. (5)

    The clients upload local models and APP-U to the server. The server performs DMA using APP-U from clients;

  6. (6)

    Repeating steps 1-5 until the global model converges.

Refer to caption
Figure 4: Framework overview of FedDB.

4.2 Prior Bias Estimation

In this paper, we consider an FSSL setting where both the labeled data and unlabeled data are class imbalanced. In such a case, the model’s predictions can skew towards certain classes, owing to the biased label prior in the training data. This skew contradicts the training objective of FSSL, which is to achieve uniform performance across all classes.

To investigate the impact of class imbalance on model training in FSSL, we conduct preliminary experiments using the CIFAR10 dataset. We establish a scenario with 10 clients, each participating in model training in every round. The number of labeled and unlabeled samples is set to 4000400040004000 and 46000460004600046000, respectively. The class imbalance is created by the Dirichlet distribution, as declared in Section 5.

As shown in Figure 1, both the local and global models exhibit a biased prediction towards certain classes. However, estimating the above bias in FSSL is challenging due to the data heterogeneity and imprecision in pseudo-labeling. By extensive experiments, we discover that the prior bias can be effectively approximated by the Average Prediction Probability on Unlabeled Data (APP-U). Specifically, for client m𝑚mitalic_m, the APP-U, denoted by 𝒑¯msubscript¯𝒑𝑚\overline{\bm{p}}_{m}over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, can be calculated by:

𝒑¯m=n=1Num𝒑(𝒚|αあるふぁ(𝒙un))Num,subscript¯𝒑𝑚superscriptsubscript𝑛1superscriptsubscript𝑁𝑢𝑚𝒑conditional𝒚𝛼superscriptsubscript𝒙𝑢𝑛superscriptsubscript𝑁𝑢𝑚\overline{\bm{p}}_{m}=\frac{{\textstyle\sum_{n=1}^{N_{u}^{m}}}\bm{p}(\bm{y}|% \alpha(\bm{x}_{u}^{n}))}{N_{u}^{m}},over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG , (8)

where Numsuperscriptsubscript𝑁𝑢𝑚N_{u}^{m}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the total number of unlabeled samples on client m𝑚mitalic_m, 𝒑(𝒚|αあるふぁ(𝒙un))𝒑conditional𝒚𝛼superscriptsubscript𝒙𝑢𝑛\bm{p}(\bm{y}|\alpha(\bm{x}_{u}^{n}))bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) is the prediction probability of the weak augmentation of sample 𝒙unsuperscriptsubscript𝒙𝑢𝑛\bm{x}_{u}^{n}bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

We adopt JS divergence as a metric to quantify the disparity between two distributions. A larger JS divergence indicates a greater disparity between the distributions. Taking the class-wise accuracy on a balanced test dataset as the ground truth bias, we calculate the JS divergence between it and either APP-U or the labeled data distribution. As shown in Figure 3, for both local and global models, the JS divergence between APP-U and the ground truth is significantly smaller than that between the labeled data distribution and the ground truth. This demonstrates the effectiveness of APP-U as a metric for quantifying prior bias in FSSL.

4.3 Debiased Pseudo-Labeling

In this subsection, we detail the procedure of DPL. Given an FL model parameterized by 𝒘𝒘\bm{w}bold_italic_w, we first obtain the prediction probability 𝒑s(𝒚|𝒙)subscript𝒑𝑠conditional𝒚𝒙\bm{p}_{s}(\bm{y}|\bm{x})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) by applying a softmax function to unnormalized logits, as illustrated below:

𝒑s(y|𝒙)=e𝒛(𝒙)[y]k=1Ke𝒛(𝒙)[k],subscript𝒑𝑠conditional𝑦𝒙superscript𝑒𝒛𝒙delimited-[]𝑦superscriptsubscript𝑘1𝐾superscript𝑒𝒛𝒙delimited-[]𝑘\bm{p}_{s}(y|\bm{x})=\frac{e^{\bm{z}(\bm{x})[y]}}{{\textstyle\sum_{k=1}^{K}}e^% {\bm{z}(\bm{x})[k]}},bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT bold_italic_z ( bold_italic_x ) [ italic_y ] end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_z ( bold_italic_x ) [ italic_k ] end_POSTSUPERSCRIPT end_ARG , (9)

where 𝒛(𝒙)[y]𝒛𝒙delimited-[]𝑦\bm{z}(\bm{x})[y]bold_italic_z ( bold_italic_x ) [ italic_y ] is the y𝑦yitalic_y-th unnormalized logit.

By applying the Bayes’ theorem to 𝒑s(y|𝒙)subscript𝒑𝑠conditional𝑦𝒙\bm{p}_{s}(y|\bm{x})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ), we obtain:

𝒑s(y|𝒙)=𝒑s(y)𝒑s(𝒙|y)k=1K𝒑s(k)𝒑s(𝒙|k).subscript𝒑𝑠conditional𝑦𝒙subscript𝒑𝑠𝑦subscript𝒑𝑠conditional𝒙𝑦superscriptsubscript𝑘1𝐾subscript𝒑𝑠𝑘subscript𝒑𝑠conditional𝒙𝑘\bm{p}_{s}(y|\bm{x})=\frac{\bm{p}_{s}(y)\bm{p}_{s}(\bm{x}|y)}{{\textstyle\sum_% {k=1}^{K}}\bm{p}_{s}(k)\bm{p}_{s}(\bm{x}|k)}.bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) = divide start_ARG bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x | italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x | italic_k ) end_ARG . (10)

Due to the class imbalance in our FSSL settings, the prior distribution 𝒑s(k)subscript𝒑𝑠𝑘\bm{p}_{s}(k)bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ), as outputted by the model, is biased towards certain majority classes. This leads to a biased prediction probability 𝒑s(y|𝒙)subscript𝒑𝑠conditional𝑦𝒙\bm{p}_{s}(y|\bm{x})bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ), causing the model to be overconfident in these majority classes. The objective of DPL is to seek a conditional probability 𝒑t(y|𝒙)subscript𝒑𝑡conditional𝑦𝒙\bm{p}_{t}(y|\bm{x})bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) that is robust across all classes, given the estimation of the model’s biased prior 𝒑¯¯𝒑\overline{\bm{p}}over¯ start_ARG bold_italic_p end_ARG, as defined in Eq. (8).

Following previous studies Tian et al. (2020); Kairouz et al. (2021); Hong et al. (2021), we assume that the class conditional likelihoods are the same in both the biased and debiased predictions, i.e., 𝒑t(𝒙|y)=𝒑s(𝒙|y)subscript𝒑𝑡conditional𝒙𝑦subscript𝒑𝑠conditional𝒙𝑦\bm{p}_{t}(\bm{x}|y)=\bm{p}_{s}(\bm{x}|y)bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | italic_y ) = bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x | italic_y ). By rearranging Eq. (9) and Eq. (10), we have:

ln(𝒑t(y)𝒑t(𝒙|y))=subscript𝒑𝑡𝑦subscript𝒑𝑡conditional𝒙𝑦absent\displaystyle\ln(\bm{p}_{t}(y)\bm{p}_{t}(\bm{x}|y))=roman_ln ( bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | italic_y ) ) = 𝒛(𝒙)[y]+ln(𝒑t(y))ln(𝒑s(y))𝒛𝒙delimited-[]𝑦subscript𝒑𝑡𝑦subscript𝒑𝑠𝑦\displaystyle\bm{z}(\bm{x})[y]+\ln(\bm{p}_{t}(y))-\ln(\bm{p}_{s}(y))bold_italic_z ( bold_italic_x ) [ italic_y ] + roman_ln ( bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ) - roman_ln ( bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) (11)
+ln(k=1K𝒑s(k)𝒑s(𝒙|k))superscriptsubscript𝑘1𝐾subscript𝒑𝑠𝑘subscript𝒑𝑠conditional𝒙𝑘\displaystyle+\ln({\textstyle\sum_{k=1}^{K}}\bm{p}_{s}(k)\bm{p}_{s}(\bm{x}|k))+ roman_ln ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x | italic_k ) )
ln(k=1Ke𝒛(𝒙)[k]).superscriptsubscript𝑘1𝐾superscript𝑒𝒛𝒙delimited-[]𝑘\displaystyle-\ln({\textstyle\sum_{k=1}^{K}}e^{\bm{z}(\bm{x})[k]}).- roman_ln ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_z ( bold_italic_x ) [ italic_k ] end_POSTSUPERSCRIPT ) .

Recalling that:

𝒛(𝒙)[y]=ln𝒑s(y|𝒙)+lnk=1K𝒑s(k)𝒑s(𝒙|k).𝒛𝒙delimited-[]𝑦subscript𝒑𝑠conditional𝑦𝒙superscriptsubscript𝑘1𝐾subscript𝒑𝑠𝑘subscript𝒑𝑠conditional𝒙𝑘\bm{z}(\bm{x})[y]=\ln\bm{p}_{s}(y|\bm{x})+\ln{{\textstyle\sum_{k=1}^{K}}\bm{p}% _{s}(k)\bm{p}_{s}(\bm{x}|k)}.bold_italic_z ( bold_italic_x ) [ italic_y ] = roman_ln bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) + roman_ln ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_x | italic_k ) . (12)

We derive the following debiased posterior probability:

𝒑t(y|𝒙)subscript𝒑𝑡conditional𝑦𝒙\displaystyle\bm{p}_{t}(y|\bm{x})bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) =𝒑t(y)𝒑t(𝒙|y)k=1K𝒑t(k)𝒑t(𝒙|k)absentsubscript𝒑𝑡𝑦subscript𝒑𝑡conditional𝒙𝑦superscriptsubscript𝑘1𝐾subscript𝒑𝑡𝑘subscript𝒑𝑡conditional𝒙𝑘\displaystyle=\frac{\bm{p}_{t}(y)\bm{p}_{t}(\bm{x}|y)}{\sum_{k=1}^{K}\bm{p}_{t% }(k)\bm{p}_{t}(\bm{x}|k)}= divide start_ARG bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x | italic_k ) end_ARG (13)
=𝒑s(y|𝒙)𝒑t(y)/𝒑s(y)k=1K𝒑s(k|𝒙)𝒑t(k)/𝒑s(k),absentsubscript𝒑𝑠conditional𝑦𝒙subscript𝒑𝑡𝑦subscript𝒑𝑠𝑦superscriptsubscript𝑘1𝐾subscript𝒑𝑠conditional𝑘𝒙subscript𝒑𝑡𝑘subscript𝒑𝑠𝑘\displaystyle=\frac{\bm{p}_{s}(y|\bm{x})\bm{p}_{t}(y)/\bm{p}_{s}(y)}{{% \textstyle\sum_{k=1}^{K}}\bm{p}_{s}(k|\bm{x})\bm{p}_{t}(k)/\bm{p}_{s}(k)},= divide start_ARG bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k | bold_italic_x ) bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) / bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) end_ARG ,

where 𝒑t(k)subscript𝒑𝑡𝑘\bm{p}_{t}(k)bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) is a uniform distribution that is robust for all classes. By applying the estimated bias 𝒑¯¯𝒑\overline{\bm{p}}over¯ start_ARG bold_italic_p end_ARG as the approximation of the prior bias 𝒑ssubscript𝒑𝑠\bm{p}_{s}bold_italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we can obtain the debiased prediction probability of unlabeled data as follows:

𝒑^=𝒑(𝒚|𝒙)/𝒑¯k=1K𝒑(k|𝒙)/𝒑¯k.^𝒑𝒑conditional𝒚𝒙¯𝒑superscriptsubscript𝑘1𝐾𝒑conditional𝑘𝒙subscript¯𝒑𝑘\hat{\bm{p}}=\frac{\bm{p}(\bm{y}|\bm{x})/\overline{\bm{p}}}{{\textstyle\sum_{k% =1}^{K}}\bm{p}(k|\bm{x})/\overline{\bm{p}}_{k}}.over^ start_ARG bold_italic_p end_ARG = divide start_ARG bold_italic_p ( bold_italic_y | bold_italic_x ) / over¯ start_ARG bold_italic_p end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p ( italic_k | bold_italic_x ) / over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (14)

Intuitively, Eq. (14) serves as a regularization term that smooths the prediction probabilities of the majority classes and sharpens these of the minority classes, which can alleviate the prior bias introduced by the heterogeneous data. The detailed procedures of DPL are shown in Algorithm 1.

Input: Confidence threshold τたう𝜏\tauitalic_τたう
Output: Debiased pseudo-labels 𝒀^^𝒀\hat{\bm{Y}}over^ start_ARG bold_italic_Y end_ARG, APP-U 𝒑¯¯𝒑\overline{\bm{p}}over¯ start_ARG bold_italic_p end_ARG
1
2𝒑¯=n=1Nu𝒑(𝒚|αあるふぁ(𝒙un))Nu¯𝒑superscriptsubscript𝑛1subscript𝑁𝑢𝒑conditional𝒚𝛼superscriptsubscript𝒙𝑢𝑛subscript𝑁𝑢\overline{\bm{p}}=\frac{{\textstyle\sum_{n=1}^{N_{u}}}\bm{p}(\bm{y}|\alpha(\bm% {x}_{u}^{n}))}{N_{u}}over¯ start_ARG bold_italic_p end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG, 𝒀^:={}assign^𝒀\hat{\bm{Y}}:=\{\}over^ start_ARG bold_italic_Y end_ARG := { }
3 for n=1,2,,Nu𝑛12subscript𝑁𝑢n=1,2,...,N_{u}italic_n = 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT do
4       𝒑^n:=𝒑(𝒚|αあるふぁ(𝒙un))/𝒑¯k=1K𝒑(k|αあるふぁ(𝒙un))/𝒑¯kassignsuperscript^𝒑𝑛𝒑conditional𝒚𝛼superscriptsubscript𝒙𝑢𝑛¯𝒑superscriptsubscript𝑘1𝐾𝒑conditional𝑘𝛼superscriptsubscript𝒙𝑢𝑛subscript¯𝒑𝑘\hat{\bm{p}}^{n}:=\frac{\bm{p}(\bm{y}|\alpha(\bm{x}_{u}^{n}))/\overline{\bm{p}% }}{{\textstyle\sum_{k=1}^{K}}\bm{p}(k|\alpha(\bm{x}_{u}^{n}))/\overline{\bm{p}% }_{k}}over^ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT := divide start_ARG bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) / over¯ start_ARG bold_italic_p end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_p ( italic_k | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) / over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG
5       if max(𝐩^n)τたうsuperscript^𝐩𝑛𝜏\max(\hat{\bm{p}}^{n})\geq\tauroman_max ( over^ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≥ italic_τたう then
6             𝒀^:=𝒀^argmax(𝒑^n\hat{\bm{Y}}:=\hat{\bm{Y}}\oplus\arg\max(\hat{\bm{p}}^{n}over^ start_ARG bold_italic_Y end_ARG := over^ start_ARG bold_italic_Y end_ARG ⊕ roman_arg roman_max ( over^ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT)
7            
8      else
9             𝒀^:=𝒀^{0}Kassign^𝒀direct-sum^𝒀superscript0𝐾\hat{\bm{Y}}:=\hat{\bm{Y}}\oplus\{0\}^{K}over^ start_ARG bold_italic_Y end_ARG := over^ start_ARG bold_italic_Y end_ARG ⊕ { 0 } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT
10      
11Return 𝒀^^𝒀\hat{\bm{Y}}over^ start_ARG bold_italic_Y end_ARG, 𝒑¯¯𝒑\overline{\bm{p}}over¯ start_ARG bold_italic_p end_ARG
Algorithm 1 DPL: Debiased Pseudo-labeling

4.4 Debiased Model Aggregation

The objective of DMA is to computing aggregation weights that enable the model to perform uniformly across all classes. During each local updating round, the activated clients send their accumulated APP-U 𝒑¯msubscript¯𝒑𝑚\overline{\bm{p}}_{m}over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and their latest models 𝒘msubscript𝒘𝑚\bm{w}_{m}bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to the server. Then we can get the aggregated APP-U as follows:

𝒑¯aggr=m𝒮t𝜷m𝒑¯m,subscript¯𝒑𝑎𝑔𝑔𝑟subscript𝑚subscript𝒮𝑡subscript𝜷𝑚subscript¯𝒑𝑚\overline{\bm{p}}_{aggr}={\textstyle\sum_{m\in\mathcal{S}_{t}}}\bm{\beta}_{m}% \overline{\bm{p}}_{m},over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (15)

where 𝜷msubscript𝜷𝑚\bm{\beta}_{m}bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the aggregation weight for client m𝑚mitalic_m. To achieve a more balanced model, we expect 𝒑¯aggrsubscript¯𝒑𝑎𝑔𝑔𝑟\overline{\bm{p}}_{aggr}over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT to be more uniform, leading to the following optimization objective:

min𝜷aggr=m=1M(𝒑¯aggr𝒑t)2subscript𝜷subscript𝑎𝑔𝑔𝑟superscriptsubscript𝑚1𝑀superscriptsubscript¯𝒑𝑎𝑔𝑔𝑟subscript𝒑𝑡2\displaystyle\min_{\bm{\beta}}\mathcal{L}_{aggr}=\sqrt{\textstyle\sum_{m=1}^{M% }(\overline{\bm{p}}_{aggr}-\bm{p}_{t})^{2}}roman_min start_POSTSUBSCRIPT bold_italic_βべーた end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (16)
s.t.m𝒮t𝜷m=1,s.t.subscript𝑚subscript𝒮𝑡subscript𝜷𝑚1\displaystyle\text{s.t.}\textstyle\sum_{m\in\mathcal{S}_{t}}\bm{\beta}_{m}=1,s.t. ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1 ,

where 𝒑t={1K}Ksubscript𝒑𝑡superscript1𝐾𝐾\bm{p}_{t}=\{\frac{1}{K}\}^{K}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { divide start_ARG 1 end_ARG start_ARG italic_K end_ARG } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the uniform distribution over K𝐾Kitalic_K classes, identical to the test dataset. In FedDB, we utilize the gradient descent algorithm to solve the above optimization problem.

After obtaining the aggregation weights 𝜷𝜷\bm{\beta}bold_italic_βべーた, we aggregate client models and update the global model as follows:

𝒘t+1=m𝒮t𝜷m𝒘mt,superscript𝒘𝑡1subscript𝑚subscript𝒮𝑡subscript𝜷𝑚superscriptsubscript𝒘𝑚𝑡\bm{w}^{t+1}=\textstyle\sum_{m\in\mathcal{S}_{t}}\bm{\beta}_{m}\cdot\bm{w}_{m}% ^{t},bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (17)

where 𝒘mtsubscriptsuperscript𝒘𝑡𝑚\bm{w}^{t}_{m}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the local model of client m𝑚mitalic_m at last round, 𝒘t+1superscript𝒘𝑡1\bm{w}^{t+1}bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is the global model. 𝒘t+1superscript𝒘𝑡1\bm{w}^{t+1}bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is then broadcast to the activated client for further updates. The processes of DMA and FedDB are presented in Algorithms 2 and 3, respectively.

Input: Local models{𝒘m}m=1Msuperscriptsubscriptsubscript𝒘𝑚𝑚1𝑀\{\bm{w}_{m}\}_{m=1}^{M}{ bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, local APP-U {𝒑¯m}m=1Msuperscriptsubscriptsubscript¯𝒑𝑚𝑚1𝑀\{\overline{\bm{p}}_{m}\}_{m=1}^{M}{ over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, updating epochs Eaggrsubscript𝐸𝑎𝑔𝑔𝑟E_{aggr}italic_E start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT, learning rate ηいーたaggrsubscript𝜂𝑎𝑔𝑔𝑟\eta_{aggr}italic_ηいーた start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT
Output: Global weight 𝒘𝒘\bm{w}bold_italic_w
1 Initialize 𝜷𝜷\bm{\beta}bold_italic_βべーた as {1M}Msuperscript1𝑀𝑀\{\frac{1}{M}\}^{M}{ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
2for e=1,2,,Eaggr𝑒12subscript𝐸𝑎𝑔𝑔𝑟e=1,2,...,E_{aggr}italic_e = 1 , 2 , … , italic_E start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT do
3       𝒑¯aggrm=1M𝜷m𝒑¯msubscript¯𝒑𝑎𝑔𝑔𝑟superscriptsubscript𝑚1𝑀subscript𝜷𝑚subscript¯𝒑𝑚\overline{\bm{p}}_{aggr}\leftarrow{\textstyle\sum_{m=1}^{M}\bm{\beta}_{m}% \overline{\bm{p}}_{m}}over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
4       aggr=m=1M(𝒑¯aggr𝒑t)2subscript𝑎𝑔𝑔𝑟superscriptsubscript𝑚1𝑀superscriptsubscript¯𝒑𝑎𝑔𝑔𝑟subscript𝒑𝑡2\mathcal{L}_{aggr}=\sqrt{\textstyle\sum_{m=1}^{M}(\overline{\bm{p}}_{aggr}-\bm% {p}_{t})^{2}}caligraphic_L start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
5       𝜷𝜷ηいーたaggraggr𝜷𝜷subscript𝜂𝑎𝑔𝑔𝑟subscript𝑎𝑔𝑔𝑟\bm{\beta}\leftarrow\bm{\beta}-\eta_{aggr}\nabla\mathcal{L}_{aggr}bold_italic_βべーた ← bold_italic_βべーた - italic_ηいーた start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT ∇ caligraphic_L start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT
6       𝜷=σしぐま(𝜷)𝜷𝜎𝜷\bm{\beta}=\sigma(\bm{\beta})bold_italic_βべーた = italic_σしぐま ( bold_italic_βべーた )
7      
8𝒘m=1M𝜷m𝒘m𝒘superscriptsubscript𝑚1𝑀subscript𝜷𝑚subscript𝒘𝑚\bm{w}\leftarrow{\textstyle\sum_{m=1}^{M}\bm{\beta}_{m}\bm{w}_{m}}bold_italic_w ← ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_βべーた start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
9
10Return 𝒘𝒘\bm{w}bold_italic_w
Algorithm 2 DMA: Debiased Model Aggregation
Input: Client number M𝑀Mitalic_M, client activate rate C𝐶Citalic_C, global rounds T𝑇Titalic_T, update epochs E𝐸Eitalic_E and Eaggrsubscript𝐸𝑎𝑔𝑔𝑟E_{aggr}italic_E start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT, learning rate ηいーた𝜂\etaitalic_ηいーた and ηいーたaggrsubscript𝜂𝑎𝑔𝑔𝑟\eta_{aggr}italic_ηいーた start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT, threshold τたう𝜏\tauitalic_τたう, unlabeled loss weight λらむだ𝜆\lambdaitalic_λらむだ, momentum accumulation coefficient γがんま𝛾\gammaitalic_γがんま
Output: Global model wTsuperscript𝑤𝑇w^{T}italic_w start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
1
2Server executes:
3 Initialize 𝒘0superscript𝒘0\bm{w}^{0}bold_italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
4 for t=1,2,,T𝑡12𝑇t=1,2,...,Titalic_t = 1 , 2 , … , italic_T do
5       𝒮tsubscript𝒮𝑡absent\mathcal{S}_{t}\leftarrowcaligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← randomly select MC𝑀𝐶M\cdot Citalic_M ⋅ italic_C clients
6       for each client in mSt𝑚subscript𝑆𝑡m\in S_{t}italic_m ∈ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in parallel do
7             𝒘mtsuperscriptsubscript𝒘𝑚𝑡\bm{w}_{m}^{t}bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, 𝒑¯mtsuperscriptsubscript¯𝒑𝑚𝑡\overline{\bm{p}}_{m}^{t}over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT \leftarrow ClientUpdate(wt1)superscript𝑤𝑡1(\bm{w}^{t-1})( bold_italic_w start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT )
8            
9      𝒘tsuperscript𝒘𝑡absent\bm{w}^{t}\leftarrowbold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← DMA({𝒘mt}m𝒮tsubscriptsuperscriptsubscript𝒘𝑚𝑡𝑚subscript𝒮𝑡\{\bm{w}_{m}^{t}\}_{m\in\mathcal{S}_{t}}{ bold_italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, {𝒑¯mt}m𝒮tsubscriptsuperscriptsubscript¯𝒑𝑚𝑡𝑚subscript𝒮𝑡\{\overline{\bm{p}}_{m}^{t}\}_{m\in\mathcal{S}_{t}}{ over¯ start_ARG bold_italic_p end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Eaggrsubscript𝐸𝑎𝑔𝑔𝑟E_{aggr}italic_E start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT, ηいーたaggrsubscript𝜂𝑎𝑔𝑔𝑟\eta_{aggr}italic_ηいーた start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT)
10Return 𝒘Tsuperscript𝒘𝑇\bm{w}^{T}bold_italic_w start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
11 ClientUpdate(𝐰tsuperscript𝐰𝑡\bm{w}^{t}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT)
12 𝒀^,𝒑¯^𝒀¯𝒑absent\hat{\bm{Y}},\overline{\bm{p}}\leftarrowover^ start_ARG bold_italic_Y end_ARG , over¯ start_ARG bold_italic_p end_ARG ← DPL(τたう𝜏\tauitalic_τたう)
13 for e=1,2,,E𝑒12𝐸e=1,2,...,Eitalic_e = 1 , 2 , … , italic_E do
14       𝒑¯e=n=1Nu𝒑(𝒚|αあるふぁ(𝒙un))Nusuperscript¯𝒑𝑒superscriptsubscript𝑛1subscript𝑁𝑢𝒑conditional𝒚𝛼superscriptsubscript𝒙𝑢𝑛subscript𝑁𝑢\overline{\bm{p}}^{e}=\frac{{\textstyle\sum_{n=1}^{N_{u}}}\bm{p}(\bm{y}|\alpha% (\bm{x}_{u}^{n}))}{N_{u}}over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG
15       s=1Nln=1NlH(𝒚n,p(𝒚|αあるふぁ(𝒙ln)))subscript𝑠1subscript𝑁𝑙superscriptsubscript𝑛1subscript𝑁𝑙Hsuperscript𝒚𝑛𝑝conditional𝒚𝛼superscriptsubscript𝒙𝑙𝑛\mathcal{L}_{s}=\frac{1}{N_{l}}\sum_{n=1}^{N_{l}}\mathrm{H}(\bm{y}^{n},p(\bm{y% }|\alpha(\bm{x}_{l}^{n})))caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_H ( bold_italic_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_p ( bold_italic_y | italic_αあるふぁ ( bold_italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) )
16      
u=1Nun=1Nu𝟙subscript𝑢1subscript𝑁𝑢superscriptsubscript𝑛1subscript𝑁𝑢1\displaystyle\mathcal{L}_{u}=\textstyle\frac{1}{N_{u}}\textstyle\sum_{n=1}^{N_% {u}}\mathbb{1}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 (max(𝒀^n))τたう)\displaystyle(\max(\hat{\bm{Y}}^{n}))\geq\tau)\cdot( roman_max ( over^ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ≥ italic_τたう ) ⋅
H(𝒀^n,p(𝒚|𝒜(𝒙un)))Hsuperscript^𝒀𝑛𝑝conditional𝒚𝒜superscriptsubscript𝒙𝑢𝑛\displaystyle\mathrm{H}(\hat{\bm{Y}}^{n},p(\bm{y}|\mathcal{A}(\bm{x}_{u}^{n})))roman_H ( over^ start_ARG bold_italic_Y end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_p ( bold_italic_y | caligraphic_A ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) )
17      
18      s+λらむだusubscript𝑠𝜆subscript𝑢\mathcal{L}\leftarrow\mathcal{L}_{s}+\lambda\mathcal{L}_{u}caligraphic_L ← caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_λらむだ caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
19      
20      𝒘e𝒘e1ηいーたsuperscript𝒘𝑒superscript𝒘𝑒1𝜂\bm{w}^{e}\leftarrow\bm{w}^{e-1}-\eta\nabla\mathcal{L}bold_italic_w start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ← bold_italic_w start_POSTSUPERSCRIPT italic_e - 1 end_POSTSUPERSCRIPT - italic_ηいーた ∇ caligraphic_L;  𝒑¯γがんま𝒑¯+(1γがんま)𝒑¯e¯𝒑𝛾¯𝒑1𝛾superscript¯𝒑𝑒\overline{\bm{p}}\leftarrow\gamma\overline{\bm{p}}+(1-\gamma)\overline{\bm{p}}% ^{e}over¯ start_ARG bold_italic_p end_ARG ← italic_γがんま over¯ start_ARG bold_italic_p end_ARG + ( 1 - italic_γがんま ) over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT
21      
22Return 𝒘Esuperscript𝒘𝐸\bm{w}^{E}bold_italic_w start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, 𝒑¯¯𝒑\overline{\bm{p}}over¯ start_ARG bold_italic_p end_ARG
Algorithm 3 FedDB: Detaching Prior Bias in FSSL

5 Experiments

This section details the experimental results in various settings to demonstrate the effectiveness of FedDB.

5.1 Experimental Setup

Datasets.

We evaluate FedDB on three benchmark datasets, including CIFAR10, SVHN, and CIFAR100. Initially, a balanced labeled dataset is separated from the original training dataset, with the residual data designated as the unlabeled dataset. When distributing these training data to clients, we sample data from a Dirichlet distribution 𝒒Dir(δでるた𝒑)similar-to𝒒Dir𝛿𝒑\bm{q}\sim\text{Dir}(\delta\bm{p})bold_italic_q ∼ Dir ( italic_δでるた bold_italic_p ), where 𝒑𝒑\bm{p}bold_italic_p is the class-wise prior distribution and δでるた𝛿\deltaitalic_δでるた is a parameter that modulates the heterogeneity among clients. A higher value of δでるた𝛿\deltaitalic_δでるた correlates with reduced data heterogeneity. To enrich the unlabeled dataset, we add the samples from the labeled dataset to the unlabeled dataset after discarding their labels. We conduct experiments in IID setting and Non-IID settings with δでるた={0.1,0.3}𝛿0.10.3\delta=\{0.1,0.3\}italic_δでるた = { 0.1 , 0.3 }. In the IID setting, the total number of labeled samples is set to 4000,1000,1000040001000100004000,1000,100004000 , 1000 , 10000 for CIFAR10, SVHN and CIFAR100, respectively. For Non-IID setting, the total number of labeled data is set to 4000400040004000 for CIFAR10 and SVHN, and 10000100001000010000 for CIFAR100. The test dataset from the original dataset is used for model evaluation.

Benchmark Methods.

We compare FedDB against the following benchmark methods:

  • FedAvg McMahan et al. (2017): The FedAvg method is applied in a constrained scenario where each client utilizes only the small labeled dataset for training.

  • FixMatch Sohn et al. (2020): This method is a basic adaptation of FixMatch within FedAvg framework.

  • FedMatch Jeong et al. (2021): FedMatch introduces the inter-client consistency loss to maximize the agreement between local models.

  • FedRGD Zhang et al. (2021): It mitigates the model bias by reducing gradient divergence among clients.

  • SemiFL Diao et al. (2022): SemiFL adopts alternate training between server and clients. Here, we adopts its client-side training due to the lack of training samples on the server in our scenario.

  • Methods combining DPL. We also conduct experiments that integrate DPL with benchmark methods. These hybrid methods are denoted as Method-FedDPL.

Implementation Details.

We primarily follow the experimental settings adopted in prior works of FSSL Jeong et al. (2021). There are a total of 100100100100 clients participating in the training, with 10101010 active clients (C=0.1)𝐶0.1(C=0.1)( italic_C = 0.1 ) engaged in each global round. The local training epoch is set to E=5𝐸5E=5italic_E = 5 and the epoch for updating the model aggregation weights is set to Eaggr=100subscript𝐸𝑎𝑔𝑔𝑟100E_{aggr}=100italic_E start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT = 100. All experiments are executed for 800800800800 global rounds. We employ Wide ResNet28x2 in our experiments. The SGD optimizer is adopted for model training, operating at learning rates ηいーた=0.03𝜂0.03\eta=0.03italic_ηいーた = 0.03 for local updating and ηいーたaggr=1.0subscript𝜂𝑎𝑔𝑔𝑟1.0\eta_{aggr}=1.0italic_ηいーた start_POSTSUBSCRIPT italic_a italic_g italic_g italic_r end_POSTSUBSCRIPT = 1.0 for aggregation, complemented by a momentum of 0.90.90.90.9. Due to the limited number of samples on clients, we feed all training data simultaneously to the model during local training. The confidence threshold for pseudo-labeling is set to τたう=0.95𝜏0.95\tau=0.95italic_τたう = 0.95. The data augmentation operation is consistent with those described in FixMatch Sohn et al. (2020). All experiments are repeated for 4444 times and we report the mean and standard deviation of the best accuracy during training.

5.2 Results on Benchmark Datasets

The experimental results are presented in Tables 1 - 3, where values inside the parentheses represent the mean, and values outside the parentheses represent the standard deviation of multiple experiments. It can be observed that with the same number of labeled samples, the accuracy of all methods decreases as δでるた𝛿\deltaitalic_δでるた decreases, demonstrating that data heterogeneity is a key factor harming model performance. FedAvg, despite its simplicity, serves as a reliable benchmark method, particularly as the dataset difficulty increases (e.g., CIFAR100). This issue is also noted by Diao et al. (2022). This demonstrates that improperly incorporating unlabeled data into training can negatively impact the model’s training. Compared with other FSSL methods, FedDB enhances test accuracy, demonstrating the effectiveness of FedDB in the FSSL scenario. The same conclusion can also be drawn from Figure 5.

Dataset CIFAR10 SVHN CIFAR100
FedAvg 58.42(0.61) 25.10(0.76) 32.00(0.80)
FixMatch 65.80(2.72) 87.44(1.35) 24.72(0.73)
FedMatch 39.63(1.66) 25.09(5.40) 9.44(0.66)
FedRGD 63.27(1.47) 81.04(2.43) 14.45(0.42)
SemiFL 57.24(7.96) 85.58(10.03) 22.61(3.07)
FixMatch-FedDPL 66.97(2.84) 88.00(0.67) 26.44(1.73)
FedMatch-FedDPL 43.06(3.16) 25.90(3.12) 9.47(0.79)
FedRGD-FedDPL 64.75(1.20) 81.24(5.36) 17.17(0.98)
SemiFL-FedDPL 68.46(3.61) 86.77(1.79) 27.67(0.89)
FedDB 67.32(2.31) 86.75(0.90) 26.71(0.87)
Table 1: Experimental results in the IID setting.
Dataset CIFAR10 SVHN CIFAR100
FedAvg 47.72(1.95) 69.44(6.21) 31.34(0.36)
FixMatch 50.99(2.49) 86.61(0.19) 25.47(0.46)
FedMatch 38.64(2.49) 26.04(4.85) 8.77(0.57)
FedRGD 51.45(2.39) 86.89(3.21) 14.83(0.34)
SemiFL 50.07(1.05) 76.11(6.3) 26.40(0.81)
FixMatch-FedDPL 53.92(3.41) 85.87(0.51) 28.47(0.13)
FedMatch-FedDPL 39.17(2.10) 27.02(3.13) 8.87(0.11)
FedRGD-FedDPL 51.57(1.67) 87.00(1.31) 19.94(0.75)
SemiFL-FedDPL 55.42(2.57) 87.61(0.91) 28.29(0.73)
FedDB 55.00(1.17) 85.99(0.49) 29.28(0.51)
Table 2: Experimental results in the Non-IID setting with δでるた=0.3𝛿0.3\delta=0.3italic_δでるた = 0.3.
Refer to caption
Figure 5: Convergence curve on CIFAR100. (a) IID, (b)Non-IID with δでるた=0.3𝛿0.3\delta=0.3italic_δでるた = 0.3.

5.3 Effectiveness of DPL

As illustrated in Table 4, employing DPL results in substantial gains for FedDB. Furthermore, DPL can be regarded as a convenient plug-in that can be easily integrated into existing FSSL methods utilizing pseudo-labeling. As shown in Tables 1 - 3, introducing DPL to existing FSSL methods effectively enhances their performance. Figure 6 displays the accuracy of pseudo-labels during training. It indicates that DPL effectively enhances the accuracy of these pseudo-labels, which in turn benefits FSSL training. Figure 7 presents the ratio of pseudo-labeled samples in the unlabeled data. However, introducing DPL does not consistently improve the ratio of pseudo-labeled samples, as the model in FSSL is challenging to train, making it difficult for samples to be pseudo-labeled.

Refer to caption
Figure 6: Accuracy of pseudo labels on CIFAR100. (a) IID, (b)Non-IID with δでるた=0.3𝛿0.3\delta=0.3italic_δでるた = 0.3.
Dataset CIFAR10 SVHN CIFAR100
FedAvg 33.53(1.9) 32.21(1.52) 28.78(0.53)
FixMatch 35.14(1.53) 74.31(2.07) 25.90(1.06)
FedMatch 31.12(2.69) 12.66(3.34) 7.50(0.99)
FedRGD 35.33(3.73) 38.20(5.64) 18.04(1.59)
SemiFL 33.72(1.87) 72.76(6.19) 25.82(0.44)
FixMatch-FedDPL 37.13(3.22) 76.29(1.00) 27.76(0.85)
FedMatch-FedDPL 32.26(2.75) 16.94(1.28) 7.66(0.43)
FedRGD-FedDPL 35.59(3.49) 38.76(2.67) 18.98(0.58)
SemiFL-FedDPL 37.84(2.33) 74.54(7.51) 27.62(1.00)
FedDB 37.95(2.21) 76.20(1.31) 27.99(1.28)
Table 3: Experimental results in the Non-IID setting with δでるた=0.1𝛿0.1\delta=0.1italic_δでるた = 0.1.
Refer to caption
Figure 7: Ratio of unlabeled samples that are finally assigned with pseudo-labels on CIFAR100. (a) IID, (b)Non-IID with δでるた=0.3𝛿0.3\delta=0.3italic_δでるた = 0.3.

5.4 Effectiveness of DMA

As shown in Table 4, DMA generally contributes positively to FedDB in most scenarios. However, its impact differs among various datasets. More specifically, DMA consistently results in improved outcomes on the CIFAR10 and CIFAR100 datasets. Conversely, on the SVHN dataset, DMA can lead to performance decline in certain scenarios. Upon detailed analysis, we ascribe this issue to the imbalanced distribution of the SVHN dataset, which contravenes the objective of FSSL that seeks for a balanced model.

IID DPL DMA CIFAR10 SVHN CIFAR100
- - 65.80(2.72) 87.44(1.35) 24.72(0.73)
- 66.97(2.84) 88.00(0.67) 26.44(1.73)
67.32(2.31) 86.75(0.90) 26.71(0.87)
δでるた=0.3𝛿0.3\delta=0.3italic_δでるた = 0.3 DPL DMA CIFAR10 SVHN CIFAR100
- - 50.99(2.49) 86.61(0.19) 25.47(0.46)
- 53.92(3.41) 85.87(0.51) 28.47(0.13)
55.00(1.17) 85.99(0.49) 29.28(0.51)
δでるた=0.1𝛿0.1\delta=0.1italic_δでるた = 0.1 DPL DMA CIFAR10 SVHN CIFAR100
- - 35.14(1.53) 74.31(2.07) 25.90(1.06)
- 37.13(3.22) 76.29(1.00) 27.76(0.85)
37.95(2.21) 76.20(1.31) 27.99(1.28)
Table 4: Ablation studies on CIFAR10, SVHN, and CIFAR100.

6 Conclusion

In this paper, we propose FedDB to detach prior bias in FSSL with class imbalance. At the local training level, FedDB debiases the pseudo-labeling using APP-U based on Bayes’ theorem, encouraging a more balanced training data during the training. At the global aggregation level, FedDB leverages APP-U across different clients to derive optimal aggregation weights, aiming to debias the global model. Extensive experiments have shown the effectiveness of FedDB.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 62372028 and 62372027.

References

  • Acar et al. [2020] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2020.
  • Bai et al. [2023] Sikai Bai, Shuaicheng Li, Weiming Zhuang, Kunlin Yang, Jun Hou, Shuai Yi, Shuai Zhang, Junyu Gao, Jie Zhang, and Song Guo. Combating data imbalances in federated semi-supervised learning with dual regulators. arXiv preprint arXiv:2307.05358, 2023.
  • Berthelot et al. [2019] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.
  • Collins et al. [2021] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In International conference on machine learning, pages 2089–2099. PMLR, 2021.
  • Diao et al. [2022] Enmao Diao, Jie Ding, and Vahid Tarokh. Semifl: Semi-supervised federated learning for unlabeled clients with alternate training. Advances in Neural Information Processing Systems, 35:17871–17884, 2022.
  • Guo and Li [2022] Lan-Zhe Guo and Yu-Feng Li. Class-imbalanced semi-supervised learning with adaptive thresholding. In International Conference on Machine Learning, pages 8082–8094. PMLR, 2022.
  • Hong et al. [2021] Youngkyu Hong, Seungju Han, Kwanghee Choi, Seokjun Seo, Beomsu Kim, and Buru Chang. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6626–6636, 2021.
  • Jeong et al. [2021] Wonyong Jeong, Jaehong Yoon, Eunho Yang, and Sung Ju Hwang. Federated semi-supervised learning with inter-client consistency & disjoint learning. In International Conference on Learning Representations, 2021.
  • Jiang et al. [2022] Meirui Jiang, Hongzheng Yang, Xiaoxiao Li, Quande Liu, Pheng-Ann Heng, and Qi Dou. Dynamic bank learning for semi-supervised federated image diagnosis with class imbalance. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 196–206. Springer, 2022.
  • Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
  • Lee and others [2013] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896, 2013.
  • Li et al. [2020a] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
  • Li et al. [2020b] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
  • Li et al. [2023] Ming Li, Qingli Li, and Yan Wang. Class balanced adaptive pseudo labeling for federated semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16292–16301, 2023.
  • Liang et al. [2022] Xiaoxiao Liang, Yiqun Lin, Huazhu Fu, Lei Zhu, and Xiaomeng Li. Rscfed: random sampling consensus federated semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10154–10163, 2022.
  • Liao et al. [2023] Xinting Liao, Weiming Liu, Chaochao Chen, Pengyang Zhou, Huabin Zhu, Yanchao Tan, Jun Wang, and Yue Qi. Hyperfed: hyperbolic prototypes exploration with consistent aggregation for non-iid data in federated learning. arXiv preprint arXiv:2307.14384, 2023.
  • Lin et al. [2021] Haowen Lin, Jian Lou, Li Xiong, and Cyrus Shahabi. Semifed: Semi-supervised federated learning with consistency and pseudo-labeling. arXiv preprint arXiv:2108.09412, 2021.
  • Lin [1991] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.
  • Liu et al. [2023] Jiahao Liu, Jiang Wu, Jinyu Chen, Miao Hu, Yipeng Zhou, and Di Wu. Feddwa: personalized federated learning with dynamic weight adjustment. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 3993–4001, 2023.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Miyato et al. [2018] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
  • Reddi et al. [2021] Sashank J Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021.
  • Sohn et al. [2020] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  • Tan et al. [2022] Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8432–8440, 2022.
  • Tian et al. [2020] Junjiao Tian, Yen-Cheng Liu, Nathaniel Glaser, Yen-Chang Hsu, and Zsolt Kira. Posterior re-calibration for imbalanced datasets. Advances in Neural Information Processing Systems, 33:8101–8113, 2020.
  • Wang et al. [2020] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • Wang et al. [2023] Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Yue Fan, , Zhen Wu, Jindong Wang, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, Bernt Schiele, and Xing Xie. Freematch: Self-adaptive thresholding for semi-supervised learning. 2023.
  • Wei et al. [2021] Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, and Fan Yang. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10857–10866, 2021.
  • Zhang et al. [2021] Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E Gonzalez, Kannan Ramchandran, and Michael W Mahoney. Improving semi-supervised federated learning by reducing the gradient diversity of models. In 2021 IEEE International Conference on Big Data (Big Data), pages 1214–1225. IEEE, 2021.
  • Zhao et al. [2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • Zhu et al. [2023] Guogang Zhu, Xuefeng Liu, Shaojie Tang, and Jianwei Niu. Aligning before aggregating: Enabling communication efficient cross-domain federated learning via consistent feature extraction. IEEE Transactions on Mobile Computing, 2023.