Search CORE

1,617 research outputs found

An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis

Author: Oluwaseun Ajayi Olabode
Publication venue: 'University of the Western Cape Library Service'
Publication date: 01/01/2018
Field of study

>Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by using large volumes of data obtained through techniques including next-generation sequencing (NGS). The application of NGS in biomedical research is gaining in momentum, and with its adoption becoming more widespread, there is an increasing need for access to customizable computational workflows that can simplify, and offer access to, computer intensive analyses of genomic data. In this study, the Galaxy and Ruffus frameworks were designed and implemented with a view to address the challenges faced in biomedical research. Galaxy, a graphical web-based framework, allows researchers to build a graphical NGS data analysis pipeline for accessible, reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework used by bioinformaticians as Python library to write scripts in object-oriented style, allows for building a workflow in terms of task dependencies and execution logic. In this study, a dual data analysis technique was explored which focuses on a comparative evaluation of Galaxy and Ruffus frameworks that are used in composing analysis pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the analysis pipeline in Galaxy displayed a higher percentage of load and store instructions. In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The CPU usage, memory utilization, and runtime execution are graphically represented in this study. Our evaluation suggests that workflow frameworks have distinctly different features from ease of use, flexibility, and portability, to architectural designs

UWC Theses and Dissertations

e-Galform 1 year research masters

Author: Seaden Mark
Publication venue
Publication date: 01/01/2007
Field of study

Advancing internet technologies and increasing computer processing and data transfer rates have allowed computers separated by large distances to communicate with each other and transfer large amounts of data that were previously impractical. This has opened new opportunities allowing university departments to share research and information via web servers and web browsers. In this thesis, I describe the development of e-Galform, an internet based database application that seeks to allow scientists both within the University of Durham and from other universities across the globe to take advantage of Galform, a galaxy formation model developed by theoretical galaxy formation research staff at Durham. e-Galform features a web based interface allowing users to understand the capabilities of Galform without the necessity to understand the finer underlying technical and scientific complexities, whilst offering documentation that would support further understanding. A user can extract tailored predictions from a library of pre-existing Galform runs using the e-Galform web interface. A further primary feature is the production of Galform data in a new and more verbose data format, VOTable, which may be used in other database applications and is expected to become a standardised data format for use in astronomical software globally. The VOTable format is under development primarily by the United States Virtual Observatory (US-VO). Rather than run the Galform simulation directly, e-Galform extracts requested galaxy properties by running an intermediate binary program (samplegals.exe) on a pre-generated Galform dataset. e-Galform is also configurable and extendible via the use of ๒-built administrative facilities. The aim of the administrative facilities is to allow users to extend the facility to extract newly added galaxy properties as the underlying Galform model is extended, without the necessity of requiring new code

Durham e-Theses

Computing environments for reproducibility: Capturing the 'Whole Tale'

Author: Brinckman Adam
Chard Kyle
Gaffney Niall
Hategan Mihael
Jones Matthew B.
Kowalik Kacper
Kulasekaran Sivakumar
Ludäscher Bertram
Mecum Bryce D.
Nabrzyski Jarek
Stodden Victoria
Taylor Ian J.
Turk Matthew J.
Turner Kandace
Publication venue: 'Elsevier BV'
Publication date: 01/02/2018
Field of study

The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale project aims to address these barriers by connecting computational, data-intensive research efforts with the larger research process—transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create “living publications” or tales. The Whole Tale focuses on the full spectrum of science, empowering users in the long tail of science, and power users with demands for access to big data and compute resources. We report here on the design, architecture, and implementation of the Whole Tale environment

arXiv.org e-Print Archive

Crossref

Online Research @ Cardiff

RepeatFS: A File System Providing Reproducibility Through Provenance and Automation

Author: Westbrook Anthony Stephen
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/09/2021
Field of study

Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which computational analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions, and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed two conceptual models and implemented them through RepeatFS, a file system that records, replicates, and verifies computational workflows with no alteration to the original methods. RepeatFS also provides provenance visualization and task automation. We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences

UNH Scholars' Repository

Agile parallel bioinformatics workflow management using Pwrake

Author: Mishima Hiroyuki
Sasaki Kensaku
Tanaka Masahiro
Tatebe Osamu
Yoshiura Koh-ichiro
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error. Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. Findings We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Conclusions Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at <url>http://github.com/misshie/Workflows</url>.</p

Springer - Publisher Connector

Directory of Open Access Journals

Nagasaki University's Academic Output SITE: NAOSITE

PubMed Central

Nagasaki university's Academic Output SITE

DolphinNext: a distributed data processing platform for high throughput genomics

Author: Garber Manuel
Kucukural Alper
Ozturk Ahmet R.
Turkyilmaz Osman
Yukselen Onur
Publication venue: eScholarship@UMassChan
Publication date: 19/04/2020
Field of study

BACKGROUND: The emergence of high throughput technologies that produce vast amounts of genomic data, such as next-generation sequencing (NGS) is transforming biological research. The dramatic increase in the volume of data, the variety and continuous change of data processing tools, algorithms and databases make analysis the main bottleneck for scientific discovery. The processing of high throughput datasets typically involves many different computational programs, each of which performs a specific step in a pipeline. Given the wide range of applications and organizational infrastructures, there is a great need for highly parallel, flexible, portable, and reproducible data processing frameworks. Several platforms currently exist for the design and execution of complex pipelines. Unfortunately, current platforms lack the necessary combination of parallelism, portability, flexibility and/or reproducibility that are required by the current research environment. To address these shortcomings, workflow frameworks that provide a platform to develop and share portable pipelines have recently arisen. We complement these new platforms by providing a graphical user interface to create, maintain, and execute complex pipelines. Such a platform will simplify robust and reproducible workflow creation for non-technical users as well as provide a robust platform to maintain pipelines for large organizations. RESULTS: To simplify development, maintenance, and execution of complex pipelines we created DolphinNext. DolphinNext facilitates building and deployment of complex pipelines using a modular approach implemented in a graphical interface that relies on the powerful Nextflow workflow framework by providing 1. A drag and drop user interface that visualizes pipelines and allows users to create pipelines without familiarity in underlying programming languages. 2. Modules to execute and monitor pipelines in distributed computing environments such as high-performance clusters and/or cloud 3. Reproducible pipelines with version tracking and stand-alone versions that can be run independently. 4. Modular process design with process revisioning support to increase reusability and pipeline development efficiency. 5. Pipeline sharing with GitHub and automated testing 6. Extensive reports with R-markdown and shiny support for interactive data visualization and analysis. CONCLUSION: DolphinNext is a flexible, intuitive, web-based data processing and analysis platform that enables creating, deploying, sharing, and executing complex Nextflow pipelines with extensive revisioning and interactive reporting to enhance reproducible results

eScholarship@UMMS

Fine-Grained Workflow Interoperability in Life Sciences

Author: de la Garza Luis
Publication venue: Universität Tübingen
Publication date: 01/01/2019
Field of study

In den vergangenen Jahrzehnten führten Fortschritte in den Schlüsseltechnologien der Lebenswissenschaften zu einer exponentiellen Zunahme der zur Verfügung stehenden biologischen Daten. Um Ergebnisse zeitnah generieren zu können werden sowohl spezialisierte Rechensystem als auch Programmierfähigkeiten benötigt: Desktopcomputer oder monolithische Ansätze sind weder in der Lage mit dem Wachstum der verfügbaren biologischen Daten noch mit der Komplexität der Analysetechniken Schritt zu halten. Workflows erlauben diesem Trend durch Parallelisierungsansätzen und verteilten Rechensystemen entgegenzuwirken. Ihre transparenten Abläufe, gegeben durch ihre klar definierten Strukturen, ebenso ihre Wiederholbarkeit, erfüllen die Standards der Reproduzierbarkeit, welche an wissenschaftliche Methoden gestellt werden. Eines der Ziele unserer Arbeit ist es Forschern beim Bedienen von Rechensystemen zu unterstützen, ohne dass Programmierkenntnisse notwendig sind. Dafür wurde eine Sammlung von Tools entwickelt, welche jedes Kommandozeilenprogramm in ein Workflowsystem integrieren kann. Ohne weitere Anpassungen kann unser Programm zwei weit verbreitete Workflowsysteme unterstützen. Unser modularer Entwurf erlaubt zudem Unterstützung für weitere Workflowmaschinen hinzuzufügen. Basierend auf der Bedeutung von frühen und robusten Workflowentwürfen, haben wir außerdem eine wohl etablierte Desktop–basierte Analyseplattform erweitert. Diese enthält über 2.000 Aufgaben, wobei jede als Baustein in einem Workflow fungiert. Die Plattform erlaubt einfache Entwicklung neuer Aufgaben und die Integration externer Kommandozeilenprogramme. In dieser Arbeit wurde ein Plugin zur Konvertierung entwickelt, welches nutzerfreundliche Mechanismen bereitstellt, um Workflows auf verteilten Hochleistungsrechensystemen auszuführen—eine Aufgabe, die sonst technische Kenntnisse erfordert, die gewöhnlich nicht zum Anforderungsprofil eines Lebenswissenschaftlers gehören. Unsere Konverter–Erweiterung generiert quasi identische Versionen desselben Workflows, welche im Anschluss auf leistungsfähigen Berechnungsressourcen ausgeführt werden können. Infolgedessen werden nicht nur die Möglichkeiten von verteilten hochperformanten Rechensystemen sowie die Bequemlichkeit eines für Desktopcomputer entwickelte Workflowsystems ausgenutzt, sondern zusätzlich werden Berechnungsbeschränkungen von Desktopcomputern und die steile Lernkurve, die mit dem Workflowentwurf auf verteilten Systemen verbunden ist, umgangen. Unser Konverter–Plugin hat sofortige Anwendung für Forscher. Wir zeigen dies in drei für die Lebenswissenschaften relevanten Anwendungsbeispielen: Strukturelle Bioinformatik, Immuninformatik, und Metabolomik.Recent decades have witnessed an exponential increase of available biological data due to advances in key technologies for life sciences. Specialized computing resources and scripting skills are now required to deliver results in a timely fashion: desktop computers or monolithic approaches can no longer keep pace with neither the growth of available biological data nor the complexity of analysis techniques. Workflows offer an accessible way to counter against this trend by facilitating parallelization and distribution of computations. Given their structured and repeatable nature, workflows also provide a transparent process to satisfy strict reproducibility standards required by the scientific method. One of the goals of our work is to assist researchers in accessing computing resources without the need for programming or scripting skills. To this effect, we created a toolset able to integrate any command line tool into workflow systems. Out of the box, our toolset supports two widely–used workflow systems, but our modular design allows for seamless additions in order to support further workflow engines. Recognizing the importance of early and robust workflow design, we also extended a well–established, desktop–based analytics platform that contains more than two thousand tasks (each being a building block for a workflow), allows easy development of new tasks and is able to integrate external command line tools. We developed a converter plug–in that offers a user–friendly mechanism to execute workflows on distributed high–performance computing resources—an exercise that would otherwise require technical skills typically not associated with the average life scientist's profile. Our converter extension generates virtually identical versions of the same workflows, which can then be executed on more capable computing resources. That is, not only did we leverage the capacity of distributed high–performance resources and the conveniences of a workflow engine designed for personal computers but we also circumvented computing limitations of personal computers and the steep learning curve associated with creating workflows for distributed environments. Our converter extension has immediate applications for researchers and we showcase our results by means of three use cases relevant for life scientists: structural bioinformatics, immunoinformatics and metabolomics

Publikationsserver der Universität Tübingen