3,423 research outputs found

    Heterogeneous Datasets for Federated Survival Analysis Simulation

    Get PDF
    Heterogeneous Datasets for Federated Survival Analysis Simulation This repo contains three algorithms for constructing realistic federated datasets for survival analysis. Each algorithm starts from an existing non-federated dataset and assigns each sample to a specific client in the federation. The algorithms are: uniform_split: assigns each sample to a random client with uniform probability; quantity_skewed_split: assigns each sample to a random client according to the Dirichlet distribution [3, 4]; label_skewed_split: assigns each sample to a time bin, then assigns a set of samples from each bin to the clients according to the Dirichlet distribution [3, 4]. For more information, please take a look at our paper at https://arxiv.org/abs/2301.12166 [1]. Content federated_survival_datasets.zip: the content of the repository at https://github.com/archettialberto/federated_survival_datasets Heterogheneous_Datasets_for_Federated_Survival_Analysis_Simulation.pdf: the conference paper describing the work. Installation Federated Survival Datasets is built on top of numpy and scikit-learn. To install those libraries you can run pip install -r requirements.txt. To import survival datasets into your project, we strongly recommend SurvSet (https://github.com/ErikinBC/SurvSet) [2], a comprehensive collection of more than 70 survival datasets. Usage import numpy as np import pandas as pd from federated_survival_datasets import label_skewed_split # import a survival dataset and extract the input array X and the output array y df = pd.read_csv("metabric.csv") X = df[[f"x{i}" for i in range(9)]].to_numpy() y = np.array([(e, t) for e, t in zip(df["event"], df["time"])], dtype=[("event", bool), ("time", float)]) # run the splitting algorithm client_data = label_skewed_split(num_clients=8, X=X, y=y) # check the number of samples assigned to each client for i, (X_c, y_c) in enumerate(client_data): print(f"Client {i} - X: {X_c.shape}, y: {y_c.shape}") We provide an example notebook in the zipped folder to illustrate the proposed algorithms. It requires scikit-survival, seaborn, and pandas. References [1] Archetti, A., Lomurno, E., Lattari, F., Martin, A., & Matteucci, M. (2023). Heterogeneous Datasets for Federated Survival Analysis Simulation. arXiv preprint arXiv:2301.12166. [2] Drysdale, E. (2022). SurvSet: An open-source time-to-event dataset repository. arXiv preprint arXiv:2203.03094. [3] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335. [4] Li, Q., Diao, Y., Chen, Q., & He, B. (2022, May). Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE) (pp. 965-978). IEEE

    A Comparative Evaluation of FedAvg and Per-FedAvg Algorithms for Dirichlet Distributed Heterogeneous Data

    Full text link
    In this paper, we investigate Federated Learning (FL), a paradigm of machine learning that allows for decentralized model training on devices without sharing raw data, there by preserving data privacy. In particular, we compare two strategies within this paradigm: Federated Averaging (FedAvg) and Personalized Federated Averaging (Per-FedAvg), focusing on their performance with Non-Identically and Independently Distributed (Non-IID) data. Our analysis shows that the level of data heterogeneity, modeled using a Dirichlet distribution, significantly affects the performance of both strategies, with Per-FedAvg showing superior robustness in conditions of high heterogeneity. Our results provide insights into the development of more effective and efficient machine learning strategies in a decentralized setting.Comment: 6 pages, 5 figures, conferenc

    Personalized Federated Learning on Long-Tailed Data via Adversarial Feature Augmentation

    Full text link
    Personalized Federated Learning (PFL) aims to learn personalized models for each client based on the knowledge across all clients in a privacy-preserving manner. Existing PFL methods generally assume that the underlying global data across all clients are uniformly distributed without considering the long-tail distribution. The joint problem of data heterogeneity and long-tail distribution in the FL environment is more challenging and severely affects the performance of personalized models. In this paper, we propose a PFL method called Federated Learning with Adversarial Feature Augmentation (FedAFA) to address this joint problem in PFL. FedAFA optimizes the personalized model for each client by producing a balanced feature set to enhance the local minority classes. The local minority class features are generated by transferring the knowledge from the local majority class features extracted by the global model in an adversarial example learning manner. The experimental results on benchmarks under different settings of data heterogeneity and long-tail distribution demonstrate that FedAFA significantly improves the personalized performance of each client compared with the state-of-the-art PFL algorithm. The code is available at https://github.com/pxqian/FedAFA.Comment: Accepted by ICASSP 202

    Multimodal Federated Learning via Contrastive Representation Ensemble

    Full text link
    With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL), a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (modality gap and task gap), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.Comment: ICLR 2023. Code is available at https://github.com/FLAIR-THU/CreamF

    Federated Training of Dual Encoding Models on Small Non-IID Client Datasets

    Full text link
    Dual encoding models that encode a pair of inputs are widely used for representation learning. Many approaches train dual encoding models by maximizing agreement between pairs of encodings on centralized training data. However, in many scenarios, datasets are inherently decentralized across many clients (user devices or organizations) due to privacy concerns, motivating federated learning. In this work, we focus on federated training of dual encoding models on decentralized data composed of many small, non-IID (independent and identically distributed) client datasets. We show that existing approaches that work well in centralized settings perform poorly when naively adapted to this setting using federated averaging. We observe that, we can simulate large-batch loss computation on individual clients for loss functions that are based on encoding statistics. Based on this insight, we propose a novel federated training approach, Distributed Cross Correlation Optimization (DCCO), which trains dual encoding models using encoding statistics aggregated across clients, without sharing individual data samples. Our experimental results on two datasets demonstrate that the proposed DCCO approach outperforms federated variants of existing approaches by a large margin.Comment: ICLR 2023 Workshop on Pitfalls of Limited Data and Computation for Trustworthy M
    • …
    corecore