3,423 research outputs found
Heterogeneous Datasets for Federated Survival Analysis Simulation
Heterogeneous Datasets for Federated Survival Analysis Simulation
This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are:
uniform_split: assigns each sample to a random client with uniform probability;
quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4];
label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4].
For more information, please take a look at our paper at https://arxiv.org/abs/2301.12166 [1].
Content
federated_survival_datasets.zip: the content of the repository at https://github.com/archettialberto/federated_survival_datasets
Heterogheneous_Datasets_for_Federated_Survival_Analysis_Simulation.pdf: the conference paper describing the work.
Installation
Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run pip install -r requirements.txt. To import survival datasets into your project, we strongly recommend SurvSet (https://github.com/ErikinBC/SurvSet) [2], a comprehensive collection of more than 70 survival datasets.
Usage
import numpy as np
import pandas as pd
from federated_survival_datasets import label_skewed_split
# import a survival dataset and extract the input array X and the output array y
df = pd.read_csv("metabric.csv")
X = df[[f"x{i}" for i in range(9)]].to_numpy()
y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)])
# run the splitting algorithm
client_data = label_skewed_split(num_clients=8, X=X, y=y)
# check the number of samples assigned to each client
for i, (X_c, y_c) in enumerate(client_data):
print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}")
We provide an example notebook in the zipped folder to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas.
References
[1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166.
[2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094.
[3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
[4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE
A Comparative Evaluation of FedAvg and Per-FedAvg Algorithms for Dirichlet Distributed Heterogeneous Data
In this paper, we investigate Federated Learning (FL), a paradigm of machine
learning that allows for decentralized model training on devices without
sharing raw data, there by preserving data privacy. In particular, we compare
two strategies within this paradigm: Federated Averaging (FedAvg) and
Personalized Federated Averaging (Per-FedAvg), focusing on their performance
with Non-Identically and Independently Distributed (Non-IID) data. Our analysis
shows that the level of data heterogeneity, modeled using a Dirichlet
distribution, significantly affects the performance of both strategies, with
Per-FedAvg showing superior robustness in conditions of high heterogeneity. Our
results provide insights into the development of more effective and efficient
machine learning strategies in a decentralized setting.Comment: 6 pages, 5 figures, conferenc
Personalized Federated Learning on Long-Tailed Data via Adversarial Feature Augmentation
Personalized Federated Learning (PFL) aims to learn personalized models for
each client based on the knowledge across all clients in a privacy-preserving
manner. Existing PFL methods generally assume that the underlying global data
across all clients are uniformly distributed without considering the long-tail
distribution. The joint problem of data heterogeneity and long-tail
distribution in the FL environment is more challenging and severely affects the
performance of personalized models. In this paper, we propose a PFL method
called Federated Learning with Adversarial Feature Augmentation (FedAFA) to
address this joint problem in PFL. FedAFA optimizes the personalized model for
each client by producing a balanced feature set to enhance the local minority
classes. The local minority class features are generated by transferring the
knowledge from the local majority class features extracted by the global model
in an adversarial example learning manner. The experimental results on
benchmarks under different settings of data heterogeneity and long-tail
distribution demonstrate that FedAFA significantly improves the personalized
performance of each client compared with the state-of-the-art PFL algorithm.
The code is available at https://github.com/pxqian/FedAFA.Comment: Accepted by ICASSP 202
Multimodal Federated Learning via Contrastive Representation Ensemble
With the increasing amount of multimedia data on modern mobile systems and
IoT infrastructures, harnessing these rich multimodal data without breaching
user privacy becomes a critical issue. Federated learning (FL) serves as a
privacy-conscious alternative to centralized machine learning. However,
existing FL methods extended to multimodal data all rely on model aggregation
on single modality level, which restrains the server and clients to have
identical model architecture for each modality. This limits the global model in
terms of both model complexity and data capacity, not to mention task
diversity. In this work, we propose Contrastive Representation Ensemble and
Aggregation for Multimodal FL (CreamFL), a multimodal federated learning
framework that enables training larger server models from clients with
heterogeneous model architectures and data modalities, while only communicating
knowledge on public dataset. To achieve better multimodal representation
fusion, we design a global-local cross-modal ensemble strategy to aggregate
client representations. To mitigate local model drift caused by two
unprecedented heterogeneous factors stemming from multimodal discrepancy
(modality gap and task gap), we further propose two inter-modal and intra-modal
contrasts to regularize local training, which complements information of the
absent modality for uni-modal clients and regularizes local clients to head
towards global consensus. Thorough evaluations and ablation studies on
image-text retrieval and visual question answering tasks showcase the
superiority of CreamFL over state-of-the-art FL methods and its practical
value.Comment: ICLR 2023. Code is available at https://github.com/FLAIR-THU/CreamF
Federated Training of Dual Encoding Models on Small Non-IID Client Datasets
Dual encoding models that encode a pair of inputs are widely used for
representation learning. Many approaches train dual encoding models by
maximizing agreement between pairs of encodings on centralized training data.
However, in many scenarios, datasets are inherently decentralized across many
clients (user devices or organizations) due to privacy concerns, motivating
federated learning. In this work, we focus on federated training of dual
encoding models on decentralized data composed of many small, non-IID
(independent and identically distributed) client datasets. We show that
existing approaches that work well in centralized settings perform poorly when
naively adapted to this setting using federated averaging. We observe that, we
can simulate large-batch loss computation on individual clients for loss
functions that are based on encoding statistics. Based on this insight, we
propose a novel federated training approach, Distributed Cross Correlation
Optimization (DCCO), which trains dual encoding models using encoding
statistics aggregated across clients, without sharing individual data samples.
Our experimental results on two datasets demonstrate that the proposed DCCO
approach outperforms federated variants of existing approaches by a large
margin.Comment: ICLR 2023 Workshop on Pitfalls of Limited Data and Computation for
Trustworthy M
- …