1,579 research outputs found

    Scalable and cost-effective NGS genotyping in the cloud

    Get PDF
    Background: While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10’s of dollars. Results: We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Conclusions: Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.Yassine Souilmi, Alex K. Lancaster, Jae-Yoon Jung, Ettore Rizzo, Jared B. Hawkins, Ryan Powles, Saaïd Amzazi, Hassan Ghazal, Peter J. Tonellato and Dennis P. Wal

    A Multi-Code Analysis Toolkit for Astrophysical Simulation Data

    Full text link
    The analysis of complex multiphysics astrophysical simulations presents a unique and rapidly growing set of challenges: reproducibility, parallelization, and vast increases in data size and complexity chief among them. In order to meet these challenges, and in order to open up new avenues for collaboration between users of multiple simulation platforms, we present yt (available at http://yt.enzotools.org/), an open source, community-developed astrophysical analysis and visualization toolkit. Analysis and visualization with yt are oriented around physically relevant quantities rather than quantities native to astrophysical simulation codes. While originally designed for handling Enzo's structure adaptive mesh refinement (AMR) data, yt has been extended to work with several different simulation methods and simulation codes including Orion, RAMSES, and FLASH. We report on its methods for reading, handling, and visualizing data, including projections, multivariate volume rendering, multi-dimensional histograms, halo finding, light cone generation and topologically-connected isocontour identification. Furthermore, we discuss the underlying algorithms yt uses for processing and visualizing data, and its mechanisms for parallelization of analysis tasks.Comment: 18 pages, 6 figures, emulateapj format. Resubmitted to Astrophysical Journal Supplement Series with revisions from referee. yt can be found at http://yt.enzotools.org

    Data Provenance and Management in Radio Astronomy: A Stream Computing Approach

    Get PDF
    New approaches for data provenance and data management (DPDM) are required for mega science projects like the Square Kilometer Array, characterized by extremely large data volume and intense data rates, therefore demanding innovative and highly efficient computational paradigms. In this context, we explore a stream-computing approach with the emphasis on the use of accelerators. In particular, we make use of a new generation of high performance stream-based parallelization middleware known as InfoSphere Streams. Its viability for managing and ensuring interoperability and integrity of signal processing data pipelines is demonstrated in radio astronomy. IBM InfoSphere Streams embraces the stream-computing paradigm. It is a shift from conventional data mining techniques (involving analysis of existing data from databases) towards real-time analytic processing. We discuss using InfoSphere Streams for effective DPDM in radio astronomy and propose a way in which InfoSphere Streams can be utilized for large antennae arrays. We present a case-study: the InfoSphere Streams implementation of an autocorrelating spectrometer, and using this example we discuss the advantages of the stream-computing approach and the utilization of hardware accelerators

    Architecture of the Grid Services Toolkit for Process Data Processing

    Get PDF
    Grid is a rapidly growing new technology that will provide easy access to huge amounts of computer resources, both hardware and software. As these resources become available soon, more and more scientific users are interested in benefiting from them. At this time the main problem accessing the Grid is that scientific users usually need big knowledge of Grid methods and technologies besides their own field of research. To fill the gap between high-level scientific Grid users and low-level functions in Grid environments the Grid Services Toolkit (GST) is developed at the IPE. Aimed to simplify and accelerate the development of parallelized scientific Grid applications, the GST is based on Web services extended by a rich client API. It is especially designed for the field of process data processing providing database access and management, common methods of statistical data analysis and project specific methods

    CGHub: Kick-starting the Worldwide Genome Web

    Get PDF
    The University of California, Santa Cruz (UCSC) is under contract with the National Cancer Institute (NCI) to construct and operate the Cancer Genomics Hub (CGHub), a nation-scale library and user portal for cancer genomics data.  This contract covers growth of the library to 5 Petabytes. The NCI programs that feed into the library currently produce about 20 terabytes of data each month. We discuss the receiver-driven file transfer mechanism Annai GeneTorrent (GT) for use with the library. Annai GT uses multiple TCP streams from multiple computers at the library site to parallelize genome downloads.  We review our performance experience with the new transfer mechanism and also explain additions to the transfer protocol to support the security required in handling patient cancer genomics data

    The aDORe federation architecture: digital repositories at scale

    Get PDF

    Forensicloud: An Architecture for Digital Forensic Analysis in the Cloud

    Get PDF
    The amount of data that must be processed in current digital forensic examinations continues to rise. Both the volume and diversity of data are obstacles to the timely completion of forensic investigations. Additionally, some law enforcement agencies do not have the resources to handle cases of even moderate size. To address these issues we have developed an architecture for a cloud-based distributed processing platform we have named Forensicloud. This architecture is designed to reduce the time taken to process digital evidence by leveraging the power of a high performance computing platform and by adapting existing tools to operate within this environment. Forensicloud’s Software and Infrastructure as a Service service models allow investigators to use remote virtual environments for investigating digital evidence. These environments allow investigators the ability to use licensed and unlicensed tools that they may not have had access to before and allows some of these tools to be run on computing clusters
    • …
    corecore