Search CORE

324 research outputs found

Reproducible and User-Controlled Software Environments in HPC with Guix

Author: C Boettiger
C Ruiz
E Jeanvoine
Luka Stanisic
M Gavish
PV Gorp
Publication venue
Publication date: 01/01/2015
Field of study

Support teams of high-performance computing (HPC) systems often find themselves between a rock and a hard place: on one hand, they understandably administrate these large systems in a conservative way, but on the other hand, they try to satisfy their users by deploying up-to-date tool chains as well as libraries and scientific software. HPC system users often have no guarantee that they will be able to reproduce results at a later point in time, even on the same system-software may have been upgraded, removed, or recompiled under their feet, and they have little hope of being able to reproduce the same software environment elsewhere. We present GNU Guix and the functional package management paradigm and show how it can improve reproducibility and sharing among researchers with representative use cases.Comment: 2nd International Workshop on Reproducibility in Parallel Computing (RepPar), Aug 2015, Vienne, Austria. http://reppar.org

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

MDC Repository

Reproducible genomics analysis pipelines with GNU Guix

Author: Akalin A.
Franke V.
Gosdschan A.
Osberg B.
Ronen J.
Uyar B.
Wreczycka K.
Wurmus R.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 21/04/2018
Field of study

In bioinformatics, as well as other computationally-intensive research fields, there is a need for workflows that can reliably produce consistent output, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations or for the wider dissemination of workflows. Providing this type of reproducibility, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally come in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data, and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pig

MDC Repository

PiGx: reproducible genomics analysis pipelines with GNU Guix

Author: Akalin A.
Franke V.
Gosdschan A.
Osberg B.
Ronen J.
Uyar B.
Wreczycka K.
Wurmus R.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 21/04/2018
Field of study

In bioinformatics, as well as other computationally-intensive research fields, there is a need for workflows that can reliably produce consistent output, from known sources, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations or for the wider dissemination of workflows. Providing this type of reproducibility and traceability, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally come in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies with GNU Guix. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines called PiGx for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data, and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pigx

MDC Repository

RCAS: an RNA centric annotation system for transcriptome-wide regions of interest

Author: Akalin A.
Ohler U.
Rajewsky N.
Uyar B.
Wurmus R.
Yusuf D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/06/2017
Field of study

In the field of RNA, the technologies for studying the transcriptome have created a tremendous potential for deciphering the puzzles of the RNA biology. Along with the excitement, the unprecedented volume of RNA related omics data is creating great challenges in bioinformatics analyses. Here, we present the RNA Centric Annotation System (RCAS), an R package, which is designed to ease the process of creating gene-centric annotations and analysis for the genomic regions of interest obtained from various RNA-based omics technologies. The design of RCAS is modular, which enables flexible usage and convenient integration with other bioinformatics workflows. RCAS is an R/Bioconductor package but we also created graphical user interfaces including a Galaxy wrapper and a stand-alone web service. The application of RCAS on published datasets shows that RCAS is not only able to reproduce published findings but also helps generate novel knowledge and hypotheses. The meta-gene profiles, gene-centric annotation, motif analysis and gene-set analysis provided by RCAS provide contextual knowledge which is necessary for understanding the functional aspects of different biological events that involve RNAs. In addition, the array of different interfaces and deployment options adds the convenience of use for different levels of users. RCAS is available at http://bioconductor.org/packages/release/bioc/html/RCAS.html and http://rcas.mdc-berlin.de

MDC Repository

Language of ‘purely functional’ operating systems.

Author: Camille Akmut
Publication venue: 'Modern Language Association'
Publication date: 01/01/2022
Field of study

Due to the multitude of terminologies brought on by the emergence of the “purely functional” approach in operating systems, such a document seemed warranted; Some terms are brand new, others should be familiar but have been re-purposed while others yet though established have been replaced. Based on an extensive review of the existing literature, this language summary aims to be an entry point for researchers and others interested in this novel, and active field. Its vocabulary will hopefully not be a hindrance anymore to their various activities (theory or practice). *older re-uploa

Humanities Commons

Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

Author: Mercier Michael
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RésuméLa quantité de données produites, que ce soit dans la communauté scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a émergéface au traitement de grandes quantités de données sur les infrastructures informatiques distribuées. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisées pour l’exécution de charges de travail intensives en calcul. Cependant, la communauté HPC fait également face à un nombre croissant debesoin de traitement de grandes quantités de données dérivées de capteurs hautedéfinition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communauté HPC utilise déjà des outilsBig Data, qui ne sont pas toujours correctement intégrés, en particulier au niveaudu système de fichiers ainsi que du système de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les défis pour les infrastructures HPC, nousavons étudié plusieurs aspects de la convergence: nous avons d’abord proposé uneétude sur les méthodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de données. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelée BeBiDa basée sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous évaluons ce mécanisme en conditions réelles et en environnement simuléavec notre simulateur Batsim. En outre, nous fournissons des extensions à Batsimpour prendre en charge les entrées/sorties et présentons le développements d’unmodèle de système de fichiers générique accompagné d’un modèle d’applicationBig Data. Cela nous permet de compléter les expériences en conditions réellesde BeBiDa en simulation tout en étudiant le dimensionnement et les différentscompromis autours des systèmes de fichiers.Toutes les expériences et analyses de ce travail ont été effectuées avec la reproductibilité à l’esprit. Sur la base de cette expérience, nous proposons d’intégrerle flux de travail du développement et de l’analyse des données dans l’esprit dela reproductibilité, et de donner un retour sur nos expériences avec une liste debonnes pratiques

PiGx: reproducible genomics analysis pipelines with GNU Guix

Author: Akalin Altuna
Franke Vedran
Gosdschan Alexander
Osberg Brendan
Ronen Jonathan
Uyar Bora
Wreczycka Katarzyna
Wurmus Ricardo
Publication venue
Publication date: 30/11/2018
Field of study

In bioinformatics, as well as other computationally intensive research fields, there is a need for workflows that can reliably produce consistent output, from known sources, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations and for the wider dissemination of workflows. However, providing this type of reproducibility and traceability is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally comes in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies with GNU Guix. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines called PiGx for the analysis of RNA sequencing, chromatin immunoprecipitation sequencing, bisulfite-treated DNA sequencing, and single-cell resolution RNA sequencing. All pipelines process raw experimental data and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own datasets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pigx Document type: Articl

Scipedia

Remote Management of Embedded Systems

Author: Malina Peter
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2016
Field of study

Možnosti dnešních vestavěných zařízení rapidně rostou. Jejich výkon dovoluje běh složitějších aplikací v prostředích Internetu věcí (IoT). Složité aplikace bývají náchylné na chyby a vyžadují průběžnou aktualizaci. Systém, který umožňuje aktualizace většího množství vzdálených vestavěných zařízení, byl navrhnut a implementován. Systém byl implementován na základě studie existujících řešení a podmínek projektu BeeeOn, který se zabývá chytrou domácností.Possibilities of today's embedded devices are growing rapidly. Their performance allows them to run more complex applications in Internet of Things (IoT) environments. Complex applications tend to be error-prone and require continual updates. A system that is capable of updating a multitude of remote embedded devices was designed and implemented. This system was created based on the study of existing solutions and requirements of the project BeeeOn which concerns itself with smart homes.

Digital library of Brno University of Technology

National Repository of Grey Literature

Reproductibilité et performance : pourquoi choisir ?

Author: Courtès Ludovic
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

International audienceResearch processes often rely on high-performance computing (HPC), but HPC is often seen as antithetical to "reproducibility": one would have to choose between software that achieves high performance, and software that can be deployed in a reproducible fashion. However, by giving up on reproducibility we would give up on verifiability, a foundation of the scientific process. How can we conciliate performance and reproducibility? This article looks at two performance-critical aspects in HPC: message passing (MPI) and CPU micro-architecture tuning. Engineering work that has gone into performance portability has already proved fruitful, but some areas remain unaddressed when it comes to CPU tuning. We propose package multi-versioning, a technique developed for GNU Guix, a tool for reproducible software deployment, and show that it allows us to implement CPU tuning without compromising on reproducibility and provenance tracking.Les travaux de recherche dépendent souvent de calcul intensif (ou HPC, pour « high-performance computing »), mais celui-ci est souvent perçu comme incompatible avec la « reproductibilité » : il faudrait choisir entre un logiciel performant et un logiciel qui puisse être déployé de manière reproductible. Mais en renonçant à la reproductibilité, on perdrait la capacité de vérifier les résultats, qui est pourtant un fondement de la démarche scientifique. Comment peut-on concilier performance et reproductibilité ? Cet article s’intéresse à deux aspects critiques de la performance en HPC : le passage de messages (MPI) et les micro-architectures de processeurs. Le travail d’ingénierie pour atteindre la portabilité des performances a été fructueux, mais certaines zones d’ombres persistent lorsqu’il s’agit de produire du code pour un processeur spécifique. Nous proposons le multi-versionage de paquets, une technique développée pour GNU Guix, un outil de déploiement logiciel reproductible, et montrons que cela permet de produire du code optimisé pour un CPU sans renoncer à la reproductibilité et à la traçabilité

INRIA a CCSD electronic archive server