Search CORE

107 research outputs found

FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices using a Computing Power Aware Scheduler

Author: Chaturvedi Pranshu
Chen Han
He Shilan
Huerta E. A.
Kim Kibaek
Kindratenko Volodymyr
Li Zilinghan
Madduri Ravi
Singh Gagandeep
Publication venue
Publication date: 26/09/2023
Field of study

Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients

arXiv.org e-Print Archive

A case study for cloud based high throughput analysis of NGS data using the globus genomics system

Author: Bhuvaneshwar Krithika
Dave Utpal
Foster Ian
Gauba Robinder
Gusev Yuriy
Lacinski Lukasz
Madduri Ravi
Madhavan Subha
Rodriguez Alex
Sulakhe Dinanath
Publication venue: Published by Elsevier B.V.
Publication date: 07/11/2014
Field of study

AbstractNext generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research

Elsevier - Publisher Connector

Directory of Open Access Journals

PubMed Central

APPFLx: Providing Privacy-Preserving Cross-Silo Federated Learning as a Service

Author: Chard Ryan
Chaturvedi Pranshu
Fuhrman Jordan
Giger Maryellen
He Shilan
Hoang Trung-Hieu
Huerta E. A.
Kim Kibaek
Kindratenko Volodymyr
Li Zilinghan
Madduri Ravi
Ryu Minseok
Publication venue
Publication date: 17/08/2023
Field of study

Cross-silo privacy-preserving federated learning (PPFL) is a powerful tool to collaboratively train robust and generalized machine learning (ML) models without sharing sensitive (e.g., healthcare of financial) local data. To ease and accelerate the adoption of PPFL, we introduce APPFLx, a ready-to-use platform that provides privacy-preserving cross-silo federated learning as a service. APPFLx employs Globus authentication to allow users to easily and securely invite trustworthy collaborators for PPFL, implements several synchronous and asynchronous FL algorithms, streamlines the FL experiment launch process, and enables tracking and visualizing the life cycle of FL experiments, allowing domain experts and ML practitioners to easily orchestrate and evaluate cross-silo FL under one platform. APPFLx is available online at https://appflx.lin

arXiv.org e-Print Archive

I'll take that to go:Big data bags and minimal identifiers for exchange of large, complex datasets

Author: Chard Kyle
Clark Kristi
D' Arcy Mike
Deutsch Eric W.
Dinov Ivo
Foster Ian
Goble Carole
Heavner Ben
Kesselman Carl
Madduri Ravi
Price Nathan
Rodriguez Alexis
Soiland-Reyes Stian
Toga Arthur
Publication venue
Publication date: 01/12/2016
Field of study

Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and tools for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services

Author: Alex Rodriguez
Bo Liu
Dinanath Sulakhe
Ian T Foster
Kyle Chard
Lukasz Lacinski
Ravi K Madduri
Utpal J Dave
Publication venue
Publication date: 23/04/2020
Field of study

ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads

CiteSeerX

Recommended from our members

Reproducible big data science: A case study in continuous FAIRness

Author: Chard Kyle
D'Arcy Mike
Deutsch Eric
Foster Ian
Funk Cory
Glusman Gustavo
Heavner Ben
Jung Segun C.
Kesselman Carl
Madduri Ravi
Price Nathan
Richards Matthew
Rodriguez Alexis
Shannon Paul
Sulakhe Dinanath
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 07/06/2023
Field of study

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

Knowledge UChicago

Reproducible big data science: A case study in continuous FAIRness.

Author: Chard Kyle
D\u27Arcy Mike
Deutsch Eric W
Foster Ian
Funk Cory C
Glusman Gustavo
Heavner Ben
Jung Segun C
Kesselman Carl
Madduri Ravi
Price Nathan D
Richards Matthew A
Rodriguez Alexis
Shannon Paul
Sulakhe Dinanath
Publication venue: Providence St. Joseph Health Digital Commons
Publication date: 01/01/2019
Field of study

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

Providence St. Joseph Health Digital Commons

Recommended from our members

Transcriptome-wide association analysis identifies candidate susceptibility genes for prostate-specific antigen levels in men without prostate cancer.

Author: Berndt Sonja
Chanock Stephen
Chen Dorothy
Conti David
Dong Ruocheng
Freedman Neal
Graff Rebecca
Haiman Christopher
Hoffmann Thomas
Huang Wen-Yi
Jiang Yu
Justice Amy
Kachuri Linda
Klein Robert
Li Shengchao
Lilja Hans
Machiela Mitchell
Madduri Ravi
Mosley Jonathan
Rodriguez Alex
Schaffer Kerry
Shelley John
Van Den Eeden Stephen
Witte John
Publication venue: eScholarship, University of California
Publication date: 18/07/2024
Field of study

Deciphering the genetic basis of prostate-specific antigen (PSA) levels may improve their utility for prostate cancer (PCa) screening. Using genome-wide association study (GWAS) summary statistics from 95,768 PCa-free men, we conducted a transcriptome-wide association study (TWAS) to examine impacts of genetically predicted gene expression on PSA. Analyses identified 41 statistically significant (p < 0.05/12,192 = 4.10 × 10-6) associations in whole blood and 39 statistically significant (p < 0.05/13,844 = 3.61 × 10-6) associations in prostate tissue, with 18 genes associated in both tissues. Cross-tissue analyses identified 155 statistically significantly (p < 0.05/22,249 = 2.25 × 10-6) genes. Out of 173 unique PSA-associated genes across analyses, we replicated 151 (87.3%) in a TWAS of 209,318 PCa-free individuals from the Million Veteran Program. Based on conditional analyses, we found 20 genes (11 single tissue, nine cross-tissue) that were associated with PSA levels in the discovery TWAS that were not attributable to a lead variant from a GWAS. Ten of these 20 genes replicated, and two of the replicated genes had colocalization probability of >0.5: CCNA2 and HIST1H2BN. Six of the 20 identified genes are not known to impact PCa risk. Fine-mapping based on whole blood and prostate tissue revealed five protein-coding genes with evidence of causal relationships with PSA levels. Of these five genes, four exhibited evidence of colocalization and one was conditionally independent of previous GWAS findings. These results yield hypotheses that should be further explored to improve understanding of genetic factors underlying PSA levels

eScholarship - University of California

Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.

Author: Ament Seth A
Casella Alex M
Chard Kyle
Donovan-Maiye Rory
Ertekin-Taner Nilufer
Foster Ian
Funk Cory C
Glusman Gustavo
Golde Todd E
Heavner Ben
Hood Leroy
Jung Segun
Kesselman Carl
Madduri Ravi
Price Nathan D
Richards Matthew A
Rodriguez Alex
Shannon Paul
Toga Arthur
Van Horn John D
Xiao Yukai
Publication venue: Providence St. Joseph Health Digital Commons
Publication date: 18/08/2020
Field of study

Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits

Providence St. Joseph Health Digital Commons