324 research outputs found

    Reproducible and User-Controlled Software Environments in HPC with Guix

    Get PDF
    Support teams of high-performance computing (HPC) systems often find themselves between a rock and a hard place: on one hand, they understandably administrate these large systems in a conservative way, but on the other hand, they try to satisfy their users by deploying up-to-date tool chains as well as libraries and scientific software. HPC system users often have no guarantee that they will be able to reproduce results at a later point in time, even on the same system-software may have been upgraded, removed, or recompiled under their feet, and they have little hope of being able to reproduce the same software environment elsewhere. We present GNU Guix and the functional package management paradigm and show how it can improve reproducibility and sharing among researchers with representative use cases.Comment: 2nd International Workshop on Reproducibility in Parallel Computing (RepPar), Aug 2015, Vienne, Austria. http://reppar.org

    Reproducible genomics analysis pipelines with GNU Guix

    Get PDF
    In bioinformatics, as well as other computationally-intensive research fields, there is a need for workflows that can reliably produce consistent output, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations or for the wider dissemination of workflows. Providing this type of reproducibility, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally come in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data, and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pig

    PiGx: reproducible genomics analysis pipelines with GNU Guix

    Get PDF
    In bioinformatics, as well as other computationally-intensive research fields, there is a need for workflows that can reliably produce consistent output, from known sources, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations or for the wider dissemination of workflows. Providing this type of reproducibility and traceability, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally come in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies with GNU Guix. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines called PiGx for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data, and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pigx

    RCAS: an RNA centric annotation system for transcriptome-wide regions of interest

    Get PDF
    In the field of RNA, the technologies for studying the transcriptome have created a tremendous potential for deciphering the puzzles of the RNA biology. Along with the excitement, the unprecedented volume of RNA related omics data is creating great challenges in bioinformatics analyses. Here, we present the RNA Centric Annotation System (RCAS), an R package, which is designed to ease the process of creating gene-centric annotations and analysis for the genomic regions of interest obtained from various RNA-based omics technologies. The design of RCAS is modular, which enables flexible usage and convenient integration with other bioinformatics workflows. RCAS is an R/Bioconductor package but we also created graphical user interfaces including a Galaxy wrapper and a stand-alone web service. The application of RCAS on published datasets shows that RCAS is not only able to reproduce published findings but also helps generate novel knowledge and hypotheses. The meta-gene profiles, gene-centric annotation, motif analysis and gene-set analysis provided by RCAS provide contextual knowledge which is necessary for understanding the functional aspects of different biological events that involve RNAs. In addition, the array of different interfaces and deployment options adds the convenience of use for different levels of users. RCAS is available at http://bioconductor.org/packages/release/bioc/html/RCAS.html and http://rcas.mdc-berlin.de

    Language of ‘purely functional’ operating systems.

    Get PDF
    Due to the multitude of terminologies brought on by the emergence of the “purely functional” approach in operating systems, such a document seemed warranted; Some terms are brand new, others should be familiar but have been re-purposed while others yet though established have been replaced. Based on an extensive review of the existing literature, this language summary aims to be an entry point for researchers and others interested in this novel, and active field. Its vocabulary will hopefully not be a hindrance anymore to their various activities (theory or practice). *older re-uploa

    Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

    Get PDF
    The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RĂ©sumĂ©La quantitĂ© de donnĂ©es produites, que ce soit dans la communautĂ© scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a Ă©mergĂ©face au traitement de grandes quantitĂ©s de donnĂ©es sur les infrastructures informatiques distribuĂ©es. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisĂ©es pour l’exĂ©cution de charges de travail intensives en calcul. Cependant, la communautĂ© HPC fait Ă©galement face Ă  un nombre croissant debesoin de traitement de grandes quantitĂ©s de donnĂ©es dĂ©rivĂ©es de capteurs hautedĂ©finition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communautĂ© HPC utilise dĂ©jĂ  des outilsBig Data, qui ne sont pas toujours correctement intĂ©grĂ©s, en particulier au niveaudu systĂšme de fichiers ainsi que du systĂšme de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les dĂ©fis pour les infrastructures HPC, nousavons Ă©tudiĂ© plusieurs aspects de la convergence: nous avons d’abord proposĂ© uneĂ©tude sur les mĂ©thodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de donnĂ©es. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelĂ©e BeBiDa basĂ©e sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous Ă©valuons ce mĂ©canisme en conditions rĂ©elles et en environnement simulĂ©avec notre simulateur Batsim. En outre, nous fournissons des extensions Ă  Batsimpour prendre en charge les entrĂ©es/sorties et prĂ©sentons le dĂ©veloppements d’unmodĂšle de systĂšme de fichiers gĂ©nĂ©rique accompagnĂ© d’un modĂšle d’applicationBig Data. Cela nous permet de complĂ©ter les expĂ©riences en conditions rĂ©ellesde BeBiDa en simulation tout en Ă©tudiant le dimensionnement et les diffĂ©rentscompromis autours des systĂšmes de fichiers.Toutes les expĂ©riences et analyses de ce travail ont Ă©tĂ© effectuĂ©es avec la reproductibilitĂ© Ă  l’esprit. Sur la base de cette expĂ©rience, nous proposons d’intĂ©grerle flux de travail du dĂ©veloppement et de l’analyse des donnĂ©es dans l’esprit dela reproductibilitĂ©, et de donner un retour sur nos expĂ©riences avec une liste debonnes pratiques

    PiGx: reproducible genomics analysis pipelines with GNU Guix

    Get PDF
    In bioinformatics, as well as other computationally intensive research fields, there is a need for workflows that can reliably produce consistent output, from known sources, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for controlled comparison between different observations and for the wider dissemination of workflows. However, providing this type of reproducibility and traceability is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which generally comes in various versions. Moreover, in many fields (bioinformatics being a prime example), these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. Here, we propose a principled approach for building analysis pipelines and managing their dependencies with GNU Guix. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines called PiGx for the analysis of RNA sequencing, chromatin immunoprecipitation sequencing, bisulfite-treated DNA sequencing, and single-cell resolution RNA sequencing. All pipelines process raw experimental data and generate reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise beyond the use of the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own datasets or bioinformaticians seeking to automate all, or parts of, their analyses. In the long term, we hope our approach to reproducibility will serve as a blueprint for reproducible workflows in other areas. Our pipelines, along with their corresponding documentation and sample reports, are available at http://bioinformatics.mdc-berlin.de/pigx Document type: Articl

    Remote Management of Embedded Systems

    Get PDF
    MoĆŸnosti dneĆĄnĂ­ch vestavěnĂœch zaƙízenĂ­ rapidně rostou. Jejich vĂœkon dovoluje běh sloĆŸitějĆĄĂ­ch aplikacĂ­ v prostƙedĂ­ch Internetu věcĂ­ (IoT). SloĆŸitĂ© aplikace bĂœvajĂ­ nĂĄchylnĂ© na chyby a vyĆŸadujĂ­ prĆŻbÄ›ĆŸnou aktualizaci. SystĂ©m, kterĂœ umoĆŸĆˆuje aktualizace větĆĄĂ­ho mnoĆŸstvĂ­ vzdĂĄlenĂœch vestavěnĂœch zaƙízenĂ­, byl navrhnut a implementovĂĄn. SystĂ©m byl implementovĂĄn na zĂĄkladě studie existujĂ­cĂ­ch ƙeĆĄenĂ­ a podmĂ­nek projektu BeeeOn, kterĂœ se zabĂœvĂĄ chytrou domĂĄcnostĂ­.Possibilities of today's embedded devices are growing rapidly. Their performance allows them to run more complex applications in Internet of Things (IoT) environments. Complex applications tend to be error-prone and require continual updates. A system that is capable of updating a multitude of remote embedded devices was designed and implemented. This system was created based on the study of existing solutions and requirements of the project BeeeOn which concerns itself with smart homes.

    Reproductibilité et performance : pourquoi choisir ?

    Get PDF
    International audienceResearch processes often rely on high-performance computing (HPC), but HPC is often seen as antithetical to "reproducibility": one would have to choose between software that achieves high performance, and software that can be deployed in a reproducible fashion. However, by giving up on reproducibility we would give up on verifiability, a foundation of the scientific process. How can we conciliate performance and reproducibility? This article looks at two performance-critical aspects in HPC: message passing (MPI) and CPU micro-architecture tuning. Engineering work that has gone into performance portability has already proved fruitful, but some areas remain unaddressed when it comes to CPU tuning. We propose package multi-versioning, a technique developed for GNU Guix, a tool for reproducible software deployment, and show that it allows us to implement CPU tuning without compromising on reproducibility and provenance tracking.Les travaux de recherche dĂ©pendent souvent de calcul intensif (ou HPC, pour « high-performance computing »), mais celui-ci est souvent perçu comme incompatible avec la « reproductibilité » : il faudrait choisir entre un logiciel performant et un logiciel qui puisse ĂȘtre dĂ©ployĂ© de maniĂšre reproductible. Mais en renonçant Ă  la reproductibilitĂ©, on perdrait la capacitĂ© de vĂ©rifier les rĂ©sultats, qui est pourtant un fondement de la dĂ©marche scientifique. Comment peut-on concilier performance et reproductibilité ? Cet article s’intĂ©resse Ă  deux aspects critiques de la performance en HPC : le passage de messages (MPI) et les micro-architectures de processeurs. Le travail d’ingĂ©nierie pour atteindre la portabilitĂ© des performances a Ă©tĂ© fructueux, mais certaines zones d’ombres persistent lorsqu’il s’agit de produire du code pour un processeur spĂ©cifique. Nous proposons le multi-versionage de paquets, une technique dĂ©veloppĂ©e pour GNU Guix, un outil de dĂ©ploiement logiciel reproductible, et montrons que cela permet de produire du code optimisĂ© pour un CPU sans renoncer Ă  la reproductibilitĂ© et Ă  la traçabilitĂ©
    • 

    corecore