37 research outputs found

    GO-Docker: Batch scheduling with containers

    Get PDF
    International audienceLightweight virtualization technologies gained attention by offering performance and effective scalability across cloud and physical architecture. GO-Docker is a new open source batch scheduling tool that provides container support (Docker). It is based on proven technologies and tools to provide job isolation and custom images for user jobs.Its architecture scales to handle large configurations and provides end-user easy access with a Web UI, CLI tools and API access for external programs integration.Containers provide job isolation, preventing resources overlap, and easier management for the cluster administrators. For the end-user, it provides a choice of operating systems, pre-built configurations and possible root access to the container.Its plugin architecture eases the integration of new scheduling algorithms or other execution/control mechanisms.The software targets multi-user systems with a central authentication (ldap, ...) and shared storage (home directory, shared data, etc.) and manages Docker access for users, leveraging security concerns with container access

    The ReproGenomics Viewer: an integrative cross-species toolbox for the reproductive science community.

    No full text
    International audienceWe report the development of the ReproGenomics Viewer (RGV), a multi-and cross-species working environment for the visualization, mining and comparison of published omics data sets for the reproductive science community. The system currently embeds 15 published data sets related to gametogenesis from nine model organisms. Data sets have been curated and conveniently organized into broad categories including biological topics, technologies, species and publications. RGV's modular design for both organisms and genomic tools enables users to upload and compare their data with that from the data sets embedded in the system in a cross-species manner. The RGV is freely available at http://rgv.genouest.org

    Community-driven development for computational biology at Sprints, Hackathons and Codefests

    Get PDF
    Background: Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches. Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large. Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together. Results: This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities. Conclusions: Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects

    The BioMart community portal: an innovative alternative to large, centralized data repositories.

    Get PDF
    The BioMart Community Portal (www.biomart.org) is a community-driven effort to provide a unified interface to biomedical databases that are distributed worldwide. The portal provides access to numerous database projects supported by 30 scientific organizations. It includes over 800 different biological datasets spanning genomics, proteomics, model organisms, cancer data, ontology information and more. All resources available through the portal are independently administered and funded by their host organizations. The BioMart data federation technology provides a unified interface to all the available data. The latest version of the portal comes with many new databases that have been created by our ever-growing community. It also comes with better support and extensibility for data analysis and visualization tools. A new addition to our toolbox, the enrichment analysis tool is now accessible through graphical and web service interface. The BioMart community portal averages over one million requests per day. Building on this level of service and the wealth of information that has become available, the BioMart Community Portal has introduced a new, more scalable and cheaper alternative to the large data stores maintained by specialized organizations

    An application suite based on the IFB Container as a Service platform

    No full text
    International audienceIFB, the French Elixir Node, is a national service infrastructure which provides services and resources in bioinformatics[1] . IFB’s goal is to offer to scientific users and developers a scalable, flexible and user-friendly computation facility associated to a large storage capacity, as needed for current life science data processing. To analyze heterogeneous biological data, bioinformaticians require hundreds of different specialized software including well-established tools as well as research prototypes. In addition, these software are used alone or in workflows, from GUI or command lines, for production, tests or developments. Thus, providing an updated and complete set of tools requires huge resources. To offer an efficient service for this expected diversity of usages, we propose a software architecture and a cloud model which bring solutions for tools packaging, rapid deployment and multiple channel software distribution. We describe here the set of technical components that we built to enable a Container as a Service Model (CaaS) adapted to a bioinformatics academic cloud facility. BioShaDock BioShaDock[2] is the community based container registry for bioinformatics of the French bioinformatics Institute. It focuses on reproducibility in bioinformatics tools or pipelines using Docker containers. Containers are automatically build in background with security scans and meta data extraction. Meta-data can include general information but also ontologies terms. The BioShaDock registry already provides a large catalog of tools direcly from users, or project like Bioconda or Debian. The registry is open source and can be used by anyone, it is accessible by any Docker or rkt client. Computer scientists and bioinformaticians can more easily disseminate their programs and find potential users using a dedicated domain-centric Docker registry. There is a wide range of possible uses for container registries in bioinformatics: repositories managed at a community level, based on tools embedded in containers, allow users to exchange and replicate data analyses. GO-Docker GO-Docker[3] is a batch computing/cluster management tool using Docker as an execution/isolation system. It is dedicated to containers and has both a command line client and a web front end. It uses Docker Swarm and Apache Mesos and is compatible with google Kubernetes. A common concern regarding containers solution for cloud or HPC is related to potential security issues. First of all, we should remind that Docker implements the Linux Kernel cgroups feature and it can be used to isolate resource usage by users. Furthermore, we implemented SSL certificate and LDAP authentication in the GO-Docker Rest API prior to allow access to the job scheduler that manage the nodes where containers can be run. In addition, depending on the facility audience and exposure, an even safer solution can be obtained by using virtualized computation nodes. Developers used to command line can exploit the Go-Docker CLI that emulates classical scheduler commands. GO-Docker has a rich Rest API used in by clients. The clients (PYTHON or JAVA ) can be used in script or SaaS front end. Galaxy to Docker Galaxy is a widely adopted user-friendly web front-end for biological data processing. It provides powerful functionalities to enhance data analysis accessibility and reproducibility. It currently suits well the integration of existing command line tools and offers a large collection of bioinformatics software. However, the integration of each software needs the manual off-line creation of XML descriptor and sometimes additional wrappers: it is still a technical and time-consuming task. We propose to by-pass this limitation by enabling the direct execution of command line within any Ad Hoc container from a trusted repository like BioShaDock using the GO-Docker python API. This Galaxy to Docker component allows to create and use new “on demand tools” in a Galaxy instance without being an administrator and without need for coding. Accordingly, advanced users can easily and quickly include custom developments in their data analysis pipelines. This results in a more flexible Galaxy environment. D4WP The D4 workflow portal (D4WP) [4] is an advanced SaaS developer oriented environment for rapid tool and workflow design. It allows online graphical workflow and component authoring. Any command line tool and script are quickly captured and integrated using a full WYSIWYG approach. All workflow component dependencies can be defined as containers using an URI syntax. In this way a re-executable and self-contained workflow specification can be produced. D4WP integrates a GO-Docker scheduler API. From a unique specification, code generation can be used to target different languages to maximize potential workflow usage and dissemination. Current developments focus on Galaxy tool generation and Common Workflow Language export. The presented software components allow the creation of reproducible and flexible data analysis environments for different audiences (end users and developers) and multiple purposes (production data analysis, benchmark, workflow, tool and method development, dissemination, article publishing
) All tools embedded in containers, made available in BioShaDock and scheduled with GO-Docker are directly usable in Galaxy, D4WP and command line. We think that such an architecture limits deployment overhead and software integration cost and therefore accelerates the transfer of bioinformatics research output to production computation facilities. In a context of massive biological data production, the CaaS model offer interesting prospects. Thus, when data movement is limited by network capacity, deploying the whole CaaS environment on data production nodes may be a pragmatic solution. Furthermore, the suite of software components we presented here are developed to fit the long-term objective of the creation of a federation of interoperable clouds. Future works will include dissemination related features and compatibility and standardization effort.References1. IFB cloud: The academic cloud of the French Institute of Bioinformatics. http://www.france-bioinformatique.fr/2. Moreews F, Sallou O, MĂ©nager H et al. BioShaDock: a community driven bioinformatics shared Docker-based tools registry. F1000Research 20153. Sallou O, Monjeaud C: GO-Docker: Batch scheduling with containers. IEEE Cluster 2015. 2015.4. Moreews F: Design and share data analysis workflows. Application to bioinformatics intensive treatments. Thesis, universitĂ© de Rennes 1. 2015. http://workflow.genouest.or

    Hadoopizer : a cloud environment for bio-informatics data analysis

    Get PDF
    Biology is evolving into a big data science, particularly with the new sequencing technologies which have emerged during the last years. Cloud computing appears as one of the answers to face the rapidly increasing volume of bioinformatics data. Here we present a private cloud environment deployed on the GenOuest bioinformatics platform. After an overview of the software publicly available for bioinformatics treatments in the cloud, we present a new framework (Hadoopizer) which is a generic tool for the parallelisation of bioinformatics analysis in the cloud using the MapReduce paradigm. These developments are available online at this address: http://genocloud.genouest.or

    Recherche d'instances de motifs expressifs avec Logol. Application à la modélisation d'événements de frameshift -1

    Get PDF
    The current practice of pattern matching tools and the gap that may be observed with the actual modelling needs of people analysing genome structures clearly demonstrates the need for higher level languages to describe and search for these structures in genomic sequences. It appears necessary to offer new tools allowing to build more expressive models of families of biological sequences, on the basis of their content and structure. This article presents Logol, a new application designed to achieve pattern matching in possibly large sequences with realistic biological motifs. Logol consists in both a language for describing patterns, and the associated parser for effectively scanning sequences (RNA, DNA or protein) with such motifs. The language, based on an high level gramatical formalism, allows to express flexible patterns (with misparings and indels) composed of both sequential and structural elements (such as repeats or pseudoknots). A web page on the GenOuest BioInformatics Platform http://www.genouest.org/ gives access to the Logol application. It includes an interface for graphically drawing the motif model and an interface to display the resulting matches within the targetted pattern. Logol is presented through an illustrative application using a quite intricate motif model, which is the detection of -1 ribosomal frameshifting events in messenger RNA sequences.L'Ă©tat de la pratique des outils de reconnaissance de motifs et l'Ă©cart qui peut ĂȘtre observĂ© avec les besoins rĂ©els de modĂ©lisation des personnes en charge de l'analyse des structures gĂ©nomiques montrent clairement le besoin de langages de plus haut niveau pour dĂ©crire et rechercher ces structures dans les sĂ©quences gĂ©nomiques. Il apparaĂźt ainsi nĂ©cessaire de proposer de nouveaux outils permettant de dĂ©finir des modĂšles expressifs de familles de sĂ©quences biologiques, modĂšles basĂ©s Ă  la fois sur le contenu et la structure des sĂ©quences. Cet article prĂ©sente Logol, une application de reconnaissance de motifs conçue pour analyser des sĂ©quences potentiellement grandes avec des motifs biologiques rĂ©alistes. Logol est constituĂ© d'un langage de description de motifs et de la suite logicielle associĂ©e, permettant de rĂ©aliser effectivement l'analyse de sĂ©quences (d'ADN, ARN ou protĂ©ines) avec ces motifs. Le langage, basĂ© sur un formalisme grammatical de haut niveau, permet d'exprimer des motifs flexibles (autorisant substitutions et indels) composĂ©s Ă  la fois d'Ă©lĂ©ments de sĂ©quences et de structures (tels que des rĂ©pĂ©titions ou des pseudo-noeuds). La suite logicielle est accessible sur le web, sur la plateforme bioinformatique GenOuest http://www.genouest.org/. Elle contient notamment deux interfaces, l'une pour dessiner graphiquement le modĂšle de motif et la seconde pour afficher les rĂ©sultats comme des instances de ce modĂšle. Logol est prĂ©sentĂ© au travers d'une application illustrant les concepts utiles via l'utilisation d'un modĂšle de motif assez riche. Il s'agit de la dĂ©tection d'Ă©vĂ©nements de dĂ©calage de phase en -1 dans les ARN messagers

    Seqcrawler: biological data indexing and browsing platform.

    Get PDF
    International audienceABSTRACT: BACKGROUND: Seqcrawler takes its roots in software like SRS or Lucegene. It provides an indexing platform to ease the search of data and meta-data in biological banks and it can scale to face the current flow of data. While many biological bank search tools are available on the Internet, mainly provided by large organizations to search in their data, there is a lack of free and open source solution to browse one own set of data with a flexible query system and able to scale from single computer to a cloud system. A personal index platform will help labs and bioinformaticians to search in their meta-data but also to build a larger information system with custom subsets of data. RESULTS: The software is scalable from a single computer to a cloud-based infrastructure. It has been successfully tested in a private cloud with 3 index shards (piece of index) hosting ~400 millions of sequence information (whole GenBank, UniProt, PDB and others) for a total size of 600 GB in a fault tolerant architecture (high-availability). It has also been successfully integrated with software to add extra meta-data from blast results to enhance user's result analysis. CONCLUSIONS: Seqcrawler provides a complete open source search and store solution for labs or platforms needing to manage large amount of data/meta-data with a flexible and customizable web interface. All components (search engine, visualization and data storage), though independent, share a common and coherent data system that can be queried with a simple HTTP interface. The solution scales easily and can also provide a high availability infrastructure

    Recherche d'instances de motifs expressifs avec Logol. Application à la modélisation d'événements de frameshift -1

    No full text
    The current practice of pattern matching tools and the gap that may be observed with the actual modelling needs of people analysing genome structures clearly demonstrates the need for higher level languages to describe and search for these structures in genomic sequences. It appears necessary to offer new tools allowing to build more expressive models of families of biological sequences, on the basis of their content and structure. This article presents Logol, a new application designed to achieve pattern matching in possibly large sequences with realistic biological motifs. Logol consists in both a language for describing patterns, and the associated parser for effectively scanning sequences (RNA, DNA or protein) with such motifs. The language, based on an high level gramatical formalism, allows to express flexible patterns (with misparings and indels) composed of both sequential and structural elements (such as repeats or pseudoknots). A web page on the GenOuest BioInformatics Platform http://www.genouest.org/ gives access to the Logol application. It includes an interface for graphically drawing the motif model and an interface to display the resulting matches within the targetted pattern. Logol is presented through an illustrative application using a quite intricate motif model, which is the detection of -1 ribosomal frameshifting events in messenger RNA sequences.L'Ă©tat de la pratique des outils de reconnaissance de motifs et l'Ă©cart qui peut ĂȘtre observĂ© avec les besoins rĂ©els de modĂ©lisation des personnes en charge de l'analyse des structures gĂ©nomiques montrent clairement le besoin de langages de plus haut niveau pour dĂ©crire et rechercher ces structures dans les sĂ©quences gĂ©nomiques. Il apparaĂźt ainsi nĂ©cessaire de proposer de nouveaux outils permettant de dĂ©finir des modĂšles expressifs de familles de sĂ©quences biologiques, modĂšles basĂ©s Ă  la fois sur le contenu et la structure des sĂ©quences. Cet article prĂ©sente Logol, une application de reconnaissance de motifs conçue pour analyser des sĂ©quences potentiellement grandes avec des motifs biologiques rĂ©alistes. Logol est constituĂ© d'un langage de description de motifs et de la suite logicielle associĂ©e, permettant de rĂ©aliser effectivement l'analyse de sĂ©quences (d'ADN, ARN ou protĂ©ines) avec ces motifs. Le langage, basĂ© sur un formalisme grammatical de haut niveau, permet d'exprimer des motifs flexibles (autorisant substitutions et indels) composĂ©s Ă  la fois d'Ă©lĂ©ments de sĂ©quences et de structures (tels que des rĂ©pĂ©titions ou des pseudo-noeuds). La suite logicielle est accessible sur le web, sur la plateforme bioinformatique GenOuest http://www.genouest.org/. Elle contient notamment deux interfaces, l'une pour dessiner graphiquement le modĂšle de motif et la seconde pour afficher les rĂ©sultats comme des instances de ce modĂšle. Logol est prĂ©sentĂ© au travers d'une application illustrant les concepts utiles via l'utilisation d'un modĂšle de motif assez riche. Il s'agit de la dĂ©tection d'Ă©vĂ©nements de dĂ©calage de phase en -1 dans les ARN messagers
    corecore