Deep learning has transformed fields like computer vision, natural language processing, and is also starting to transform the field of neuroimaging. As deep learning models grow, distributed and collaborative training becomes essential, especially when sensitive data is spread across distant sites. Collaborative MRI data analysis offers profound insights, allowing researchers to utilize data beyond a study's original scope. As MRI scans are often preserved, vast amounts of data accumulate across decentralized research sites. Training models on more data, while preserving data privacy is thus crucial. Aggregating data from different sources to a central server for training can however expose this sensitive information, raising ethical concerns. Federated Learning (FL), an emerging paradigm in machine learning aims to leverage this distributed data while maintaining privacy. It achieves this by enabling devices or organizations to train models locally and share only aggregated training updates instead of raw data.
In FL, a central server coordinates training, and client sites communicate only model parameters, keeping local data private. In the decentralized setting, the server usually doesn't exist and clients train a model collaboratively among themselves. However, challenges arise due to data's statistical heterogeneity, limited communication bandwidth, and computational costs. Various methods have been proposed to address the high communication and computational costs of federated learning. Inspired by the findings from the lottery ticket hypothesis (Frankle and Carbin, 2019) which discovered that there exists sub-networks (a subset of network parameters within the larger complete neural network) which can be trained in isolation to almost full accuracy, many methods were proposed to train and update only a sub-network in the client sites (Dai et al., 2022; Huang H. et al., 2022). However, finding these sub-networks in the traditional method (Frankle and Carbin, 2019) is extremely computationally intensive and thus FL methods that rely on them (Huang H. et al., 2022) also share the same issues. Although initiating the federated training process from a random sub-network and updating the network in later work (Dai et al., 2022) brought about the benefits of both computational and communication efficiency, it came at the cost of performance due to starting the FL training process with random sub-networks. In this work we aim to solve this issue of starting from random sub-networks for the sparse FL process, targeted toward neuroimaging data.
We introduce Sparse Federated Learning for NeuroImaging or NeuroSFL a communication efficient federated learning method that identifies salient sub-networks at each client sites and trains sparse local models, greatly reducing the communication bandwidth. A notable difference of our method in contrast to competiting methods such as DisPFL (Dai et al., 2022) is that, NeuroSFL enjoys the benefits of sparse models at local cites such as faster inference (Dey et al., 2019) on top of the communication efficiency of sparse communications methods (Vahidian et al., 2021; Dai et al., 2022; Isik et al., 2022).
1.1 ContributionsNeuroSFL is a sparse federated learning method that discovers a common sub-network from the available data distributed across local sites and trains sparse local models leveraging the distributed data. Our key contributions are as follows:
1. We introduce NeuroSFL, a communication efficient federated learning approach geared toward training on distributed neuroimaging data in different client sites.
2. Our method identifies a global common sub-network at initialization and keeps this sub-network static throughout the federated learning process. Consequently, it only needs to share the sub-network masks only once before training begins, and never again, significantly reducing the communication overhead during training.
3. NeuroSFL does not need to share dense model parameters or masks during the training phase as it starts with a common initialization and only transmit sparse parameters each communication round depending on the chosen sparsity level.
4. We validate our method in a neuroimaging task and demonstrate its efficacy compared to competing methods.
5. Finally, unlike most competiting methods, to test the effectiveness of NeuroSFL, we also deploy and evaluate it in a real-world federated learning framework called COINSTAC (Plis et al., 2016) that trains neuroimaging models and report wall-clock time speed up.
2 Backgrounds and related worksIn this section, we provide the necessary background for this work by introducing the federated learning problem in Section 2.1. We then discuss the related works in Section 2.3.
2.1 Federated learningFederated Learning (FL) (McMahan et al., 2017) represents a novel approach in machine learning, facilitating model training across numerous decentralized devices or servers that hold local data samples without needing to exchange them. This contrasts sharply with traditional distributed learning methods, which centralize data and distribute computations. FL prioritizes privacy preservation, efficient communication, and resilience in diverse, heterogeneous environments. It diverges from conventional distributed learning paradigms, due to its distinct characteristics, some of which we detail below:
1. Non-IID data: the training data across different clients are not identically distributed, which means that the data at each local site may not accurately represent the overall population distribution.
2. Unbalanced data: the amount of data varies significantly across clients, leading to imbalances in data representation.
3. Massive distribution: often, the number of clients exceeds the average number of samples per client, illustrating the scale of distribution.
4. Limited communication: communication is infrequent, either among clients in a decentralized setting or between clients and the server in a centralized setting, due to slow and expensive connections.
5. Heterogeneous devices: clients in FL may have diverse computational capabilities, ranging from powerful servers to resource-constrained mobile devices.
6. Privacy preservation: FL is designed to ensure that raw data never leaves the clients' devices, preserving user privacy. Instead of sharing data, only model updates are shared. Although more sophisticated techniques have been proposed to both break the privacy guaranteed by vanilla FL (Geiping et al., 2020) and preserve the privacy (Zhang et al., 2023).
7. Local training: each client performs local training on its own data and only shares updates (e.g., weights or gradients) with the central server, which then aggregates these updates to improve the global model.
8. Client availability: clients may be intermittently available due to power constraints, connectivity issues, or user activities, requiring the system to be robust to varying participation.
9. Scalability: FL frameworks are designed to handle a large number of clients, scaling from hundreds to potentially millions of devices.
One of the main focuses of this work is to reduce the communication costs between the server and clients in a centralized setting or among clients in a decentralized setting when dealing with non-IID and unbalanced data. This is achieved by identifying a sub-network based on the data distributions at each local site and transmitting only the parameters of this sub-network in each communication round r. In each round, a fixed set of K~ clients is sampled from all K clients, and federated training continues on the selected sub-network of those clients. The general federated optimization problem encountered is detailed next.
2.2 Federated optimization problemIn the general federated learning (FL) setting, a central server tries to find a global statistical model by periodically communicating with a set of clients. The federated averaging algorithm proposed by Konečnỳ et al. (2016), McMahan et al. (2017), and Bonawitz et al. (2019) is applicable to any finite sum objective of the form
minθ∈ℝdf(θ), where f(w)=1n∑i=1nfi(θ). (1)In a typical machine learning problem, the objective function fi(θ) = ℓ(xi, yi; θ) is encountered, where the ith term in the sum is the loss of the network prediction on a sample (xi, yi) made by a model with parameter θ. We assume that the data is partitioned over a total of K clients, with Pk denoting the set of indices of the samples on client k, and nk=|Pk|. The total number of samples n is given by n=∑k=1Knk. Thus, the objective in Equation 1 can be re-written as follows in Equation 2
f(θ)=∑k=1KnknFk(θ), whereFk(θ)=1nk∑i∈Pkfi(θ). (2)In the typical distributed optimization setting, the IID assumption is made, which says the following: if the partition Pk was created by distributing the training data over the set of clients uniformly at random, then we would have EPk[Fk(θ)]=f(θ), where the expectation is over the set of examples assigned to a fixed client k. In this work, we consider the non-IID setting where this does not hold and Fk could be an arbitrarily bad approximation to f.
When designing an FL training paradigm, a set of core considerations have to be made to maintain data privacy and address statistical or objective heterogeneity due to the differences in client data and resource constraints at the client sites. A range of work tries to address the issue of heterogeneous non-IID data (McMahan et al., 2016; Kulkarni et al., 2020), however, some research also suggests that deterioration in accuracy in the FL non-IID setting is almost inevitable (Zhao et al., 2018).
2.3 Related worksIn this section, we discuss the relevant literature in relation to NeuroSFL. First, in Section 2.3.1, we describe the role of federated learning in neuroimaging and discuss the relevant literature. Second, in Section 2.3.2, we introduce key works on model pruning and sparsity in deep learning, findings from which we leverage for formulating NeuroSFL. Third, in Section 2.3.3, we describe applications of model pruning and sparsity in the FL setting for efficient FL. Finally, in Section 2.3.4, we briefly discuss privacy in the FL setting.
2.3.1 Federated learning in neuroimagingOver the past decade, the field of neuroimaging has strongly embraced data sharing, open-source software, and collaboration across multiple sites. This shift is largely driven by the need to offset the high costs and time demands associated with neuroimaging data collection (Landis et al., 2016; Rootes-Murdy et al., 2022). By pooling data from different sources, researchers can explore findings that extend beyond the initial scope of individual studies (Poldrack et al., 2013). The practice of sharing data enhances the robustness of research through larger sample sizes and the replication of results, offering significant benefits for neuroimaging studies. Even though data pooling and sharing data is embraced, there are significant challenges related to data privacy, security, and governance that limit the extent to which data can be shared. This is where FL becomes crucial as it enables collaborative model training across multiple institutions without the need to directly share sensitive data. Moreover, with FL collaborative training, sample size also plays a crucial role, where increasing the sample size not only makes predictions more reliable but also ensures the reliability and validity of research findings, thereby preventing data manipulation and fabrication (Tenopir et al., 2011; Ming et al., 2017). Furthermore, aggregating data can lead to a more diverse sample by combining otherwise similar datasets, thus reflecting a broader range of social health determinants for more comprehensive results (Laird, 2021). Additionally, reusing data can significantly reduce research costs (Milham et al., 2018).
FL is increasingly recognized as a transformative approach in healthcare and neuroimaging. In the realm of biomedical imaging, FL has been applied to a variety of tasks. These include whole-brain segmentation from MRI T1 scans (Roy et al., 2019), segmentation of brain tumors (Li et al., 2019; Sheller et al., 2019), multi-site fMRI classification, and the identification of disease biomarkers (Li X. et al., 2020). COINSTAC (Plis et al., 2016) offers a privacy-focused distributed data processing framework specifically designed for brain imaging showcasing FL's role in enhancing privacy and efficiency in healthcare data analysis. Additionally, it has been utilized in discovering brain structural relationships across various diseases and clinical cohorts through federated dimensionality reduction from shape features (Silva et al., 2019).
2.3.2 Role of model pruning in reducing computational demandsThe primary objective of model pruning is to identify sub-networks within larger architectures by selectively removing connections. This technique holds considerable appeal for various reasons, particularly for real-time applications on resource-constrained edge devices, which are prevalent in federated learning (FL) and collaborative learning scenarios. Pruning large networks can significantly alleviate the computational demands of inference (Elsen et al., 2020) or hardware tailored to exploit sparsity (Cerebras, 2019; Pool et al., 2021). More recently, the lottery ticket hypothesis has emerged (Frankle and Carbin, 2019), suggesting the existence of sub-networks within densely connected networks. These sub-networks, when trained independently from scratch, can attain comparable accuracy to fully trained dense networks (Frankle and Carbin, 2019), revitalizing the field of sparse deep learning (Chen et al., 2020; Renda et al., 2020). This resurgence of interest has also extended into sparse reinforcement learning (RL) (Arnob et al., 2021; Sokar et al., 2021). Pruning techniques in deep learning can broadly be categorized into three groups: methods that induce sparsity before training and during initialization (Lee et al., 2018; Tanaka et al., 2020; Wang et al., 2020; Ohib et al., 2023), during training (Zhu and Gupta, 2018; Ma et al., 2019; Yang et al., 2019; Ohib et al., 2022), and post-training (Han et al., 2015; Frankle et al., 2021). In this work, we leverage findings from methods that induce sparsity at initialization, specifically parameter saliency metrics, to formulate NeuroSFL.
2.3.3 Efficiency in federated learningFor pruning in the FL setting, using a Lottery Ticket like approach would result in immense inefficiency in communication. Such methods (Frankle and Carbin, 2019; Bibikar et al., 2022) usually require costly pruning and retraining cycles, often training and pruning multiple times to achieve the desired accuracy vs sparsity trade-off. Relatively few research have leveraged pruning in the FL paradigm (Li A. et al., 2020, 2021; Jiang et al., 2022). In particular, with LotteryFL (Li A. et al., 2020) and PruneFL (Jiang et al., 2022), clients need to send the full model to the server regularly resulting in higher bandwidth usage. Moreover, in Li A. et al. (2020), each client trains a personalized mask to maximize the performance only on the local data. A few recent works (Li A. et al., 2020; Bibikar et al., 2022; Huang T. et al., 2022; Qiu et al., 2022) also attempted to leverage sparse training within the FL setting as well. In particular, Li A. et al. (2020) implemented randomly initialized sparse mask, FedDST (Bibikar et al., 2022) built on the idea of RigL (Evci et al., 2020) which is a prune and re-grow technique, and mostly focussed on magnitude pruning on the server-side resulting in similar constraints and (Ohib et al., 2023) uses sparse gradients to efficiently train in a federated learning setting. In this work, we try to alleviate these limitations which we discuss in the following section.
2.3.4 Privacy in federated learningEven without sharing raw data, FL can still be vulnerable to privacy attacks such as gradient inversion attacks (Geiping et al., 2020), which can sometimes compromise privacy. Traditional FL algorithms, like federated stochastic gradient descent, are particularly susceptible to these attacks, although methods like Federated Averaging (FedAvg) (McMahan et al., 2017) mitigate this vulnerability to some extent (Geiping et al., 2020; Dimitrov et al., 2022).
Recent research has explored various privacy-preserving techniques in FL. Differential privacy has been proposed to add noise to the model updates to provide strong privacy guarantees (Abadi et al., 2016). Secure aggregation methods ensure that aggregated updates are protected against eavesdropping and manipulation during transmission (Bonawitz et al., 2017). Furthermore, advancements in cryptographic techniques, such as homomorphic encryption and secure multiparty computation, offer promising solutions for preserving privacy in FL settings (Mohassel and Zhang, 2017; Juvekar et al., 2018).
These approaches aim to enhance the robustness of Federated Learning against privacy threats while enabling collaborative model training across distributed data sources. In this work, we primarily focus on improving communication efficiency in FL systems. Although we do not explicitly address privacy, our method can be used in conjunction with other privacy-preservation techniques.
3 Method descriptionIn this section we present our proposed method. We first describe the process of discovering a sub-network f(θ⊙m) within the full network f(θ), where θ, the maskm ∈ ℝd, with ∥m∥0<d. To discover a performant sub-network an importance scoring metric is required, which we describe in Section 3.1.1. Finally, we delineate our proposed method in Section 3.2.
3.1 Sub-network discoveryGiven a dataset D=i=1n at a site k, the training of a neural network f parameterized by θ ∈ ℝd can be written as minimizing the following empirical risk as in Equation 3:
argminθ1n∑iL(f(θ;xi),yi) s.t. θ∈H (3)where θ ∈ ℝd and L and H are the loss function and the constraint set respectively.
In general, in unconstrained (standard) training the set of possible hypotheses is considered to be H=ℝd, where d is the model dimension. The objective is to minimize the empirical risk L given a training set i=1n~D at the local client site k. Given access to the gradients of the empirical risk on a batch-wise basis, an optimization algorithm such as Stochastic Gradient Descent (SGD) is typically employed to achieve the specified objective. This process generates a series of parameter estimates, i=0T, where θ0 represents the initial parameters and θT the final optimal parameters. A sub-network within this network is defined as a sparse version of this network with a mask m ∈ |θ| that results in a masked network f(θ⊙m; xi). When aiming for a target sparsity level where k < d, the parameter pruning challenge entails ensuring that the final optimal parameters, θT, have at most k non-zero elements, as denoted by the constraint ||θT||0 ≤ k. In many works, this sparsity constraint applies only to the final parameters and not to any intermediate parameter estimates. However, in this work we maintain this sparsity constraint throughout the entire training phase, that is throughout the entire evolution of θ from θ0 to θT.
The goal of discovering sub-networks at initialization introduces additional constraints to the previously described framework by requiring that all parameter iterations fall within a predetermined subspace of H. Specifically, the constraints seek to identify an initial set of parameters, θ0, that has no more than k1 non-zero elements (∥θ0∥0 ≤ k1), and ensure that all intermediate parameter sets, θi, belong to a subspace H̄⊂H for all i in , where H̄ is the subspace of ℝd spanned by the natural basis vectors j ∈ supp(θ0). Here, supp(θ0) represents the support of θ0, or the set of indices corresponding to its non-zero entries. This approach not only specifies a sub-network at initialization with k parameters but also maintains its structure consistently throughout the training.
3.1.1 Connection importance criterionLee et al. (2018) introduced a technique for estimating the importance of a connection in a deep learning network inspired by the saliency criterion originally proposed by Mozer and Smolensky (1988). They contributed an important insight, demonstrating that this criterion is remarkably effective in predicting the significance of each connection in a neural network at the initialization phase. The core concept revolves around retaining those parameters that, when altered, would have the most substantial effect on the loss function. This is operationalized by considering a binary vector c ∈ m and utilizing the Hadamard product ⊙. Consequently, SNIP calculates the sensitivity of connections based on this approach as following:
s(θ;D):=∂L(θ⊙c)∂c|c=1=∂L(θ)∂θ⊙θ (4)After determining s(θ), the parameters associated with the highest k magnitudes of |s(θ;D)i| are retained, where i corresponds to the indices of the selected parameters. Essentially, SNIP calculates the importance score of each parameter as its product with the incoming gradient. It prioritizes weights that, regardless of their direction, are distant from the origin and yield large gradient values. It's noteworthy that the objective of SNIP can be reformulated as noted by De Jorge et al. (2020) and Frankle et al. (2021):
maxcS(θ,c):=∑i∈supp(c)|θi∇L(θ)i| s.t. c∈m,||c||0=q. (5)where S is defined to be the saliency scores. It is trivial to note that the optimal solution to the above problem can be obtained by selecting the indices corresponding to the top-q values of si=|θi∇L(θ)i|.
3.1.2 Iterative connection importance criterionIn this section, we test the effectiveness of iterative-SNIP (De Jorge et al., 2020), which is an iterative version of the application of saliency criterion in Equation 4. We briefly describe the iterative-SNIP next. We assume q to be the number of parameters to be preserved post pruning. Given that we have some pruning schedule (similar to learning rate schedule: linear, exponential etc.) to divide q into a set of natural numbers t=1T such that qt>qt+1 and qT = q. Now, given the binary masking variable ct corresponding to qt, the formulation of pruning from qt to qt+1 can be made using the connection sensitivity (4) similar to De Jorge et al. (2020) as:
ct+1= argmaxθ^,c S(θ¯, c) s.t. c∈{0,1}m, ∥c∥0=kq+1, c⊙ ct= c, (6)where θ̄=θ⊙ct. The constraint c⊙ct = c is added to ensure that no previously pruning parameter is re-activated. Assuming that the pruning schedule ensures a smooth transition from one topology to another (∥ct∥0 ≈ ∥ct+1∥0) such that the gradient approximation ∂L(θ̄)∂θ̄|ct≈∂L(θ̄)∂θ̄|ct+1 is valid, Equation 6 can be approximated as solving Equation 5 at θ̄. In the scenario where the schedule parameter is set to T = 1, the original SNIP saliency method is recovered. This is basically employing a gradient approximation approach between the initial dense network c0 = 1 and the resulting mask c. We conduct experiments with IterativeSNIP in the federated neuroimaging setting and present our findings in Section 5.2.
3.2 Proposed methodWe propose a novel method for efficient distributed sub-network discovery for distributed neuroimaging and propose a method for training such sparse models or subnetworks in a communication efficient manner called Sparse Federated Learning for NeuroImaging or NeuroSFL with the goal of tackling communication inefficiency during decentralized federated learning with non-IID data distribution in the context of distributed neuroimaging data. The proposed method initiates with the common initialization θ0 at all the local client models. Next, importance scores sj are calculated for each model parameter in the network based on the information from the imaging data available across all the client sites. At this stage, each client has a unique set of importance scores for their parameters in the local network f based on the local data available at that site similar to Lee et al. (2018) and De Jorge et al. (2020). As shown in Equation 7, all the clients transmit these scores to each other and a mask m is created corresponding to the top-q % of the aggregated saliency scores:
m=Tq(∑k=0K-1sk) (7)where the Tq is the top-q operator that retains the top q percentage of the sk values by magnitude and sets the rest to zero. This mask is then used to train the model fk(θ ⊙ m; x) at site k on their local data (x,y)~Dk.
For the federated training among a total of K clients, the clients are trained locally, and at the end of local training they share their trained parameters which are then averaged; we call this a communication round. At the start of this local training, each site k starts with the same initial model weights θ0 which at each site k is denoted as θk, 0 at training step t = 0 which are then masked with the generated saliency mask m to produce the common masked initialization θk,0m as follows:
Next these models at each site k are trained on their local dataset (x,y)~Dk.
The masked models f(θk, 0⊙m) across all the sites are trained for a total of T communication rounds to arrive at the final weights θk, T at each local site. In each communication round t, only a random subset F′= of K′ clients where F′⊆F the set of all clients, and K′ ≤ K are trained on their local data. These K′ clients are sampled uniformly at random without replacement in a given round but with replacement across different rounds. We sample a subset of clients uniformly instead of including all the clients in a single communication round because previous works have shown that it is computationally more efficient and including more clients in a single round leads to diminishing returns (McMahan et al., 2016). This approach is also a standard practice in the federated learning (FL) literature (Yang et al., 2018; Reddi et al., 2020; Sun et al., 2020; Dai et al., 2022). Since each client has an equal probability of being chosen for participation in a given communication round, over the course of enough communication rounds, all clients will eventually participate. In this work, we train our FL pipeline for a total of T = 500 communication rounds, similar to Dai et al. (2022).
At the end of local training on the random subset F′, the updated weights of the selected clients are aggregated to get the new updated parameters θ^k,tm, which would be the starting weights for the next communication round. When sharing the updated weights only the weights corresponding to the 1's in the binary mask m are shared among the clients and with the server, as only these weights are being trained and the rest of the weights are zero-ed out. This results in the gains in communication efficiency. To efficiently share the model weights, the clients only share their sparse masked weights θF′m=θF′⊙m among the selected clients in F′ using the compressed sparse row (CSR) encoding. The algorithm for the training process is delineated in Algorithm 1.
Algorithm 1. NeuroSFL.
4 Experiments 4.1 Dataset and non-IID partitionWe evaluated NeuroSFL on the ABCD dataset. ABCD study is the largest long-term study of brain development and child health in the US. It recruited over 10 thousand children of 9 and 10 years old from 21 sites and followed them for 10 years with annual behavioral and cognitive assessments and biannual MRI scans (Garavan et al., 2018). Along with multi-session brain MRI scans for structure and function, the ABCD study also includes key demographic information including gender, racial information, socio-economic backgrounds, cognitive development, and mental and physical health assessments of the subjects. The ABCD open-source dataset can be found on the National Institute of Mental Health Data Archive (NDA) (https://nda.nih.gov/). In this study, we used data from the ABCD baseline, which contain 11,875 participants aged 9–10 years.
T1-weighted MRI images were preprocessed using the Statistical Parametric Mapping 12 (SPM12) software toolbox for registration, normalization, and tissue segmentation. Then the gray matter density maps were smoothed by a 6 mm3 Gaussian kernel, creating images with the dimensionality of (121, 145, 121) of voxels at Montreal Neuroimaging Institute (MNI) space with each voxel having dimensions of 1.5 × 1.5 × 1.53 mm.
We simulated the heterogeneous data distributions across federated clients through the adoption of two distinct data partitioning strategies. We outline these strategies for generating non-IID data partitions with a comprehensive discussion in Section 4.1.1.
4.1.1 Generating non-IID data partition with Dirichlet distributionIn contrast to centralized data-center training where data batches are often independent and identically distributed (IID), federated learning typically deals with non-IID data distributions across different clients. Hence, to evaluate novel federated learning methods it is crucial to not make the IID assumption to better reflect the real world setting and instead generate non-IID data among clients for evaluation (Hsu et al., 2019). In this section, we discuss the process of generating non-identical data distribution in the client sites using the Dirichlet Distribution, specifically for the context of federated learning.
4.1.1.1 Generating non-IID data from Dirichlet distributionIn this study, we assume that each client independently chooses training samples. These samples are classified into N distinct classes, with the distribution of class labels governed by a probability vector q, which is non-negative and whose components sum to 1, that is, qi>0, i ∈ [1, N] and ∥q∥1 = 1. For generating a group of non-identical clients, q~Dir(αp) is drawn from the Dirichlet Distribution, with p characterizing a prior distribution over the N classes and α controls the degree of identicality among the existing clients and is known as the concentration parameter.
In this section, we generate a range of client data partitions from the Dirichlet distribution with a range of values for the concentration parameter α for exposition. In Figure 1, we generate a group of 10 balanced clients, each holding an equal number of total samples. Similar to Hsu et al. (2019) the prior distribution p is assumed to be uniform across all classes. For each client, given a concentration parameter α, we sample a q from Dir(α) and allocate the corresponding fraction of samples from each client to that client. Figure 1 illustrates the effect of the concentration parameter α on the class distribution drawn from the Dirichlet distribution on different clients, for the CIFAR-10 dataset. When α → ∞, identical class distribution is assigned to each classes. With decreasing α, more non-identicalness is introduced in the class distribution among the client population. At the other extreme with α → 0, each client only consists of one particular class. To create a more realistic FL scenario, we used the value of α = 0.3 for all of our experiments.
Figure 1. Generating non-identical client data partitions using the Dirichlet Distribution for t
Comments (0)