1,764 research outputs found

    A Cloud Infrastructure for Optimization of a Massive Parallel Sequencing Workflow

    Get PDF
    Massive Parallel Sequencing is a term used to describe several revolutionary approaches to DNA sequencing, the so-called Next Generation Sequencing technologies. These technologies generate millions of short sequence fragments in a single run and can be used to measure levels of gene expression and to identify novel splice variants of genes allowing more accurate analysis. The proposed solution provides novelty on two fields, firstly an optimization of the read mapping algorithm has been designed, in order to parallelize processes, secondly an implementation of an architecture that consists of a Grid platform, composed of physical nodes, a Virtual platform, composed of virtual nodes set up on demand, and a scheduler that allows to integrate the two platform

    Integrating Nextflow and AWS for Large-Scale Genomic Analysis: A Hypothetical Case Study

    Get PDF
    [EN]This article explores the innovative combination of Nextflow and Amazon Web Services (AWS) to address the challenges inherent in large-scale genomic analysis. Focusing on a hypothetical case called "The Pacific Genome Atlas", it illustrates how a research organization could approach the sequencing and analysis of 10,000 genomes. Although the "Pacific Genome Atlas" is a fictional example used for illustrative purposes only, it highlights the real challenges associated with large genomic projects, such as handling huge volumes of data and the need for intensive computational analysis. Through the integration of Nextflow, a workflow management tool, with the AWS cloud infrastructure, we demonstrate how these challenges can be overcome, offering scalable, flexible and cost-effective solutions for genomic research. The adoption of modern technologies, such as those described in this article, is essential to advance the field of genomics and accelerate scientific discoveries.The present study has been funded by the AIR Genomics project (file number CCTT3/20/SA/0003) through the 2020 call for R&D Projects Oriented towards Excellence and Competitive Improvement of CCTT by the Institute of Business Competitiveness of Castilla y León and FEDER fund

    Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics

    Get PDF
    While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread - a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. © 2014 Anjani Ragothaman et al

    High-performance integrated virtual environment (HIVE) tools and applications for big data analysis

    Get PDF
    The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis
    corecore