Search CORE

2,260 research outputs found

An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis

Author: Oluwaseun Ajayi Olabode
Publication venue: 'University of the Western Cape Library Service'
Publication date: 01/01/2018
Field of study

>Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by using large volumes of data obtained through techniques including next-generation sequencing (NGS). The application of NGS in biomedical research is gaining in momentum, and with its adoption becoming more widespread, there is an increasing need for access to customizable computational workflows that can simplify, and offer access to, computer intensive analyses of genomic data. In this study, the Galaxy and Ruffus frameworks were designed and implemented with a view to address the challenges faced in biomedical research. Galaxy, a graphical web-based framework, allows researchers to build a graphical NGS data analysis pipeline for accessible, reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework used by bioinformaticians as Python library to write scripts in object-oriented style, allows for building a workflow in terms of task dependencies and execution logic. In this study, a dual data analysis technique was explored which focuses on a comparative evaluation of Galaxy and Ruffus frameworks that are used in composing analysis pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the analysis pipeline in Galaxy displayed a higher percentage of load and store instructions. In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The CPU usage, memory utilization, and runtime execution are graphically represented in this study. Our evaluation suggests that workflow frameworks have distinctly different features from ease of use, flexibility, and portability, to architectural designs

UWC Theses and Dissertations

Framing Apache Spark in life sciences

Author: Armano Giuliano
Gnocchi Matteo
Manconi Andrea
Marullo Osvaldo
Milanesi Luciano
Publication venue
Publication date: 01/01/2023
Field of study

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities

Archivio istituzionale della ricerca - Università di Cagliari

The iPlant Collaborative: Cyberinfrastructure for Plant Biology

Author: Akoglu A.
Andrews G.
Ane C.
Boyle B.
Brutnell T.
Cazes J.
Cranston K.
Donoghue M. J.
Dooley R.
Enquist B. J.
Feng X.
Gendler K.
Gessler D.
Goff S. A
Gonzales M
Grene R.
Hanlon M.
Helmke M.
Hilgert U.
Hopkins N.
Jordan C.
Kim S. J.
Kleibenstein D. J.
Koesterke L.
Kubach A.
Kvilekval K.
Leebens-Mack J.
Lenards A.
Lent M.
Lowenthal D.
Lowry S.
Lu Z.
Lyons E.
Manjunath B.S.
Matasci N.
McKay S.
McLay R.
Merchant N.
Micklos D.
Mock S.
Muir A.
Myers C. R.
Narro M.
Noutsos C.
O'Meara B.
Pasternak S.
Piel W. H.
Ram S.
Sanderson M. J.
Skidmore E.
Soltis D.
Soltis P.
Spalding E. P.
Stamatakis A.
Stanzione D.
Stapleton A. E
Stein L.
Tang C.
Tannen V.
Vaughn M.
Vision T. J.
Wang L.
Ware D.
Welch S. M.
White J. W.
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2011
Field of study

The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services

Cold Spring Harbor Laboratory Institutional Repository

PGen: large-scale genomic variations analysis workflow and browser in SoyKB

Author
Publication venue: BioMed Central
Publication date: 06/10/2016
Field of study

Springer - Publisher Connector

High-performance integrated virtual environment (HIVE) tools and applications for big data analysis

Author: Mazumder Raja
Simonyan Vahan
Publication venue: Health Sciences Research Commons
Publication date: 01/01/2014
Field of study

The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis

Directory of Open Access Journals

PubMed Central

George Washington University: Health Sciences Research Commons (HSRC)