1,764 research outputs found
A Cloud Infrastructure for Optimization of a Massive Parallel Sequencing Workflow
Massive Parallel Sequencing is a term used to describe several revolutionary approaches to DNA sequencing, the so-called Next Generation Sequencing technologies. These technologies generate millions of short sequence fragments in a single run and can be used to measure levels of gene expression and to identify novel splice variants of genes allowing more accurate analysis. The proposed solution provides novelty on two fields, firstly an optimization of the read mapping algorithm has been designed, in order to parallelize processes, secondly an implementation of an architecture that consists of a Grid platform, composed of physical nodes, a Virtual platform, composed of virtual nodes set up on demand, and a scheduler that allows to integrate the two platform
Integrating Nextflow and AWS for Large-Scale Genomic Analysis: A Hypothetical Case Study
[EN]This article explores the innovative combination of Nextflow and Amazon Web Services (AWS) to address the challenges
inherent in large-scale genomic analysis. Focusing on a hypothetical case called "The Pacific Genome Atlas", it illustrates
how a research organization could approach the sequencing and analysis of 10,000 genomes. Although the "Pacific Genome
Atlas" is a fictional example used for illustrative purposes only, it highlights the real challenges associated with large
genomic projects, such as handling huge volumes of data and the need for intensive computational analysis. Through the
integration of Nextflow, a workflow management tool, with the AWS cloud infrastructure, we demonstrate how these
challenges can be overcome, offering scalable, flexible and cost-effective solutions for genomic research. The adoption of
modern technologies, such as those described in this article, is essential to advance the field of genomics and accelerate
scientific discoveries.The present study has been funded by the AIR Genomics
project (file number CCTT3/20/SA/0003) through the 2020
call for R&D Projects Oriented towards Excellence and
Competitive Improvement of CCTT by the Institute of
Business Competitiveness of Castilla y León and FEDER
fund
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread - a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. © 2014 Anjani Ragothaman et al
High-performance integrated virtual environment (HIVE) tools and applications for big data analysis
The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis
- …