160 research outputs found

    A dataflow platform for applications based on Linked Data

    Get PDF
    Modern software applications increasingly benefit from accessing the multifarious and heterogeneous Web of Data, thanks to the use of web APIs and Linked Data principles. In previous work, the authors proposed a platform to develop applications consuming Linked Data in a declarative and modular way. This paper describes in detail the functional language the platform gives access to, which is based on SPARQL (the standard query language for Linked Data) and on the dataflow paradigm. The language features interactive and meta-programming capabilities so that complex modules/applications can be developed. By adopting a declarative style, it favours the development of modules that can be reused in various specific execution context

    Armadillo 1.1: An Original Workflow Platform for Designing and Conducting Phylogenetic Analysis and Simulations

    Get PDF
    In this paper we introduce Armadillo v1.1, a novel workflow platform dedicated to designing and conducting phylogenetic studies, including comprehensive simulations. A number of important phylogenetic and general bioinformatics tools have been included in the first software release. As Armadillo is an open-source project, it allows scientists to develop their own modules as well as to integrate existing computer applications. Using our workflow platform, different complex phylogenetic tasks can be modeled and presented in a single workflow without any prior knowledge of programming techniques. The first version of Armadillo was successfully used by professors of bioinformatics at UniversitĂ© du Quebec Ă  Montreal during graduate computational biology courses taught in 2010–11. The program and its source code are freely available at: <http://www.bioinfo.uqam.ca/armadillo>

    PRIMAGE project : predictive in silico multiscale analytics to support childhood cancer personalised evaluation empowered by imaging biomarkers

    Get PDF
    PRIMAGE is one of the largest and more ambitious research projects dealing with medical imaging, artificial intelligence and cancer treatment in children. It is a 4-year European Commission-financed project that has 16 European partners in the consortium, including the European Society for Paediatric Oncology, two imaging biobanks, and three prominent European paediatric oncology units. The project is constructed as an observational in silico study involving high-quality anonymised datasets (imaging, clinical, molecular, and genetics) for the training and validation of machine learning and multiscale algorithms. The open cloud-based platform will offer precise clinical assistance for phenotyping (diagnosis), treatment allocation (prediction), and patient endpoints (prognosis), based on the use of imaging biomarkers, tumour growth simulation, advanced visualisation of confidence scores, and machine-learning approaches. The decision support prototype will be constructed and validated on two paediatric cancers: neuroblastoma and diffuse intrinsic pontine glioma. External validation will be performed on data recruited from independent collaborative centres. Final results will be available for the scientific community at the end of the project, and ready for translation to other malignant solid tumours

    A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.</p> <p>Results</p> <p>To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (<it>e.g</it>., for biomolecular sequences, alignments, structures) and functionality (<it>e.g</it>., to parse/write standard file formats).</p> <p>Conclusions</p> <p>PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at <url>http://muralab.org/PaPy</url>, and includes extensive documentation and annotated usage examples.</p

    The building and application of a semantic platform for an e-research society

    No full text
    This thesis reviews the area of e-Research (the use of electronic infrastructure to support research) and considers how the insight gained from the development of social networking sites in the early 21st century might assist researchers in using this infrastructure. In particular it examines the myExperiment project, a website for e-Research that allows users to upload, share and annotate work flows and associated files, using a social networking framework. This Virtual Organisation (VO) supports many of the attributes required to allow a community of users to come together to build an e-Research society. The main focus of the thesis is how the emerging society that is developing out of my-Experiment could use Semantic Web technologies to provide users with a significantly richer representation of their research and research processes to better support reproducible research. One of the initial major contributions was building an ontology for myExperiment. Through this it became possible to build an API for generating and delivering this richer representation and an interface for querying it. Having this richer representation it has been possible to follow Linked Data principles to link up with other projects that have this type of representation. Doing this has allowed additional data to be provided to the user and has begun to set in context the data produced by myExperiment. The way that the myExperiment project has gone about this task and consideration of how changes may affect existing users, is another major contribution of this thesis. Adding a semantic representation to an emergent e-Research society like myExperiment,has given it the potential to provide additional applications. In particular the capability to support Research Objects, an encapsulation of a scientist's research or research process to support reproducibility. The insight gained by adding a semantic representation to myExperiment, has allowed this thesis to contribute towards the design of the architecture for these Research Objects that use similar Semantic Web technologies. The myExperiment ontology has been designed such that it can be aligned with other ontologies. Scientific Discourse, the collaborative argumentation of different claims and hypotheses, with the support of evidence from experiments, to construct, confirm or disprove theories requires the capability to represent experiments carried out in silico. This thesis discusses how, as part of the HCLS Scientific Discourse subtask group, the myExperiment ontology has begun to be aligned with other scientific discourse ontologies to provide this capability. It also compares this alignment of ontologies with the architecture for Research Objects. This thesis has also examines how myExperiment's Linked Data and that of other projects can be used in the design of novel interfaces. As a theoretical exercise, it considers how this Linked Data might be used to support a Question-Answering system, that would allow users to query myExperiment's data in a more efficient and user-friendly way. It concludes by reviewing all the steps undertaken to provide a semantic platform for an emergent e-Research society to facilitate the sharing of research and its processes to support reproducible research. It assesses their contribution to enhancing the features provided by myExperiment, as well as e-Research as a whole. It considers how the contributions provided by this thesis could be extended to produce additional tools that will allow researchers to make greater use of the rich data that is now available, in a way that enhances their research process rather than significantly changing it or adding extra workload

    Robust workflows for science and engineering

    Full text link

    Design considerations for workflow management systems use in production genomics research and the clinic

    Get PDF
    Abstract The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance

    Technical Report: CSVM Ecosystem

    Full text link
    The CSVM format is derived from CSV format and allows the storage of tabular like data with a limited but extensible amount of metadata. This approach could help computer scientists because all information needed to uses subsequently the data is included in the CSVM file and is particularly well suited for handling RAW data in a lot of scientific fields and to be used as a canonical format. The use of CSVM has shown that it greatly facilitates: the data management independently of using databases; the data exchange; the integration of RAW data in dataflows or calculation pipes; the search for best practices in RAW data management. The efficiency of this format is closely related to its plasticity: a generic frame is given for all kind of data and the CSVM parsers don't make any interpretation of data types. This task is done by the application layer, so it is possible to use same format and same parser codes for a lot of purposes. In this document some implementation of CSVM format for ten years and in different laboratories are presented. Some programming examples are also shown: a Python toolkit for using the format, manipulating and querying is available. A first specification of this format (CSVM-1) is now defined, as well as some derivatives such as CSVM dictionaries used for data interchange. CSVM is an Open Format and could be used as a support for Open Data and long term conservation of RAW or unpublished data.Comment: 31 pages including 2p of Anne
    • 

    corecore