107 research outputs found
FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices using a Computing Power Aware Scheduler
Cross-silo federated learning offers a promising solution to collaboratively
train robust and generalized AI models without compromising the privacy of
local datasets, e.g., healthcare, financial, as well as scientific projects
that lack a centralized data facility. Nonetheless, because of the disparity of
computing resources among different clients (i.e., device heterogeneity),
synchronous federated learning algorithms suffer from degraded efficiency when
waiting for straggler clients. Similarly, asynchronous federated learning
algorithms experience degradation in the convergence rate and final model
accuracy on non-identically and independently distributed (non-IID)
heterogeneous datasets due to stale local models and client drift. To address
these limitations in cross-silo federated learning with heterogeneous clients
and data, we propose FedCompass, an innovative semi-asynchronous federated
learning algorithm with a computing power aware scheduler on the server side,
which adaptively assigns varying amounts of training tasks to different clients
using the knowledge of the computing power of individual clients. FedCompass
ensures that multiple locally trained models from clients are received almost
simultaneously as a group for aggregation, effectively reducing the staleness
of local models. At the same time, the overall training process remains
asynchronous, eliminating prolonged waiting periods from straggler clients.
Using diverse non-IID heterogeneous distributed datasets, we demonstrate that
FedCompass achieves faster convergence and higher accuracy than other
asynchronous algorithms while remaining more efficient than synchronous
algorithms when performing federated learning on heterogeneous clients
A case study for cloud based high throughput analysis of NGS data using the globus genomics system
AbstractNext generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the âGlobus Genomicsâ system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research
APPFLx: Providing Privacy-Preserving Cross-Silo Federated Learning as a Service
Cross-silo privacy-preserving federated learning (PPFL) is a powerful tool to
collaboratively train robust and generalized machine learning (ML) models
without sharing sensitive (e.g., healthcare of financial) local data. To ease
and accelerate the adoption of PPFL, we introduce APPFLx, a ready-to-use
platform that provides privacy-preserving cross-silo federated learning as a
service. APPFLx employs Globus authentication to allow users to easily and
securely invite trustworthy collaborators for PPFL, implements several
synchronous and asynchronous FL algorithms, streamlines the FL experiment
launch process, and enables tracking and visualizing the life cycle of FL
experiments, allowing domain experts and ML practitioners to easily orchestrate
and evaluate cross-silo FL under one platform. APPFLx is available online at
https://appflx.lin
I'll take that to go:Big data bags and minimal identifiers for exchange of large, complex datasets
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified.
We address these issues by proposing simple methods and tools for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing.
We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets
Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services
ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads
Recommended from our members
Reproducible big data science: A case study in continuous FAIRness
Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibilityâthus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes
Reproducible big data science: A case study in continuous FAIRness.
Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes
Recommended from our members
Transcriptome-wide association analysis identifies candidate susceptibility genes for prostate-specific antigen levels in men without prostate cancer.
Deciphering the genetic basis of prostate-specific antigen (PSA) levels may improve their utility for prostate cancer (PCa) screening. Using genome-wide association study (GWAS) summary statistics from 95,768 PCa-free men, we conducted a transcriptome-wide association study (TWAS) to examine impacts of genetically predicted gene expression on PSA. Analyses identified 41 statistically significant (p < 0.05/12,192 = 4.10 Ă 10-6) associations in whole blood and 39 statistically significant (p < 0.05/13,844 = 3.61 Ă 10-6) associations in prostate tissue, with 18 genes associated in both tissues. Cross-tissue analyses identified 155 statistically significantly (p < 0.05/22,249 = 2.25 Ă 10-6) genes. Out of 173 unique PSA-associated genes across analyses, we replicated 151 (87.3%) in a TWAS of 209,318 PCa-free individuals from the Million Veteran Program. Based on conditional analyses, we found 20 genes (11 single tissue, nine cross-tissue) that were associated with PSA levels in the discovery TWAS that were not attributable to a lead variant from a GWAS. Ten of these 20 genes replicated, and two of the replicated genes had colocalization probability of >0.5: CCNA2 and HIST1H2BN. Six of the 20 identified genes are not known to impact PCa risk. Fine-mapping based on whole blood and prostate tissue revealed five protein-coding genes with evidence of causal relationships with PSA levels. Of these five genes, four exhibited evidence of colocalization and one was conditionally independent of previous GWAS findings. These results yield hypotheses that should be further explored to improve understanding of genetic factors underlying PSA levels
Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.
Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits
- âŚ