2,260 research outputs found
An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis
>Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by
using large volumes of data obtained through techniques including next-generation
sequencing (NGS). The application of NGS in biomedical research is gaining in
momentum, and with its adoption becoming more widespread, there is an increasing
need for access to customizable computational workflows that can simplify, and offer
access to, computer intensive analyses of genomic data. In this study, the Galaxy and
Ruffus frameworks were designed and implemented with a view to address the
challenges faced in biomedical research. Galaxy, a graphical web-based framework,
allows researchers to build a graphical NGS data analysis pipeline for accessible,
reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework
used by bioinformaticians as Python library to write scripts in object-oriented style,
allows for building a workflow in terms of task dependencies and execution logic. In
this study, a dual data analysis technique was explored which focuses on a comparative
evaluation of Galaxy and Ruffus frameworks that are used in composing analysis
pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the
analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed
to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the
analysis pipeline in Galaxy displayed a higher percentage of load and store instructions.
In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The
CPU usage, memory utilization, and runtime execution are graphically represented in
this study. Our evaluation suggests that workflow frameworks have distinctly different
features from ease of use, flexibility, and portability, to architectural designs
Framing Apache Spark in life sciences
Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities
The iPlant Collaborative: Cyberinfrastructure for Plant Biology
The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services
High-performance integrated virtual environment (HIVE) tools and applications for big data analysis
The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis
- …