2,879 research outputs found

    On the construction of decentralised service-oriented orchestration systems

    Get PDF
    Modern science relies on workflow technology to capture, process, and analyse data obtained from scientific instruments. Scientific workflows are precise descriptions of experiments in which multiple computational tasks are coordinated based on the dataflows between them. Orchestrating scientific workflows presents a significant research challenge: they are typically executed in a manner such that all data pass through a centralised computer server known as the engine, which causes unnecessary network traffic that leads to a performance bottleneck. These workflows are commonly composed of services that perform computation over geographically distributed resources, and involve the management of dataflows between them. Centralised orchestration is clearly not a scalable approach for coordinating services dispersed across distant geographical locations. This thesis presents a scalable decentralised service-oriented orchestration system that relies on a high-level data coordination language for the specification and execution of workflows. This system’s architecture consists of distributed engines, each of which is responsible for executing part of the overall workflow. It exploits parallelism in the workflow by decomposing it into smaller sub-workflows, and determines the most appropriate engines to execute them using computation placement analysis. This permits the workflow logic to be distributed closer to the services providing the data for execution, which reduces the overall data transfer in the workflow and improves its execution time. This thesis provides an evaluation of the presented system which concludes that decentralised orchestration provides scalability benefits over centralised orchestration, and improves the overall performance of executing a service-oriented workflow

    Interactive Visual Analysis of Networked Systems: Workflows for Two Industrial Domains

    Get PDF
    We report on a first study of interactive visual analysis of networked systems. Working with ABB Corporate Research and Ericsson Research, we have created workflows which demonstrate the potential of visualization in the domains of industrial automation and telecommunications. By a workflow in this context, we mean a sequence of visualizations and the actions for generating them. Visualizations can be any images that represent properties of the data sets analyzed, and actions typically either change the selection of data visualized or change the visualization by choice of technique or change of parameters

    Methods for enhancing the reproducibility of biomedical research findings using electronic health records.

    Get PDF
    BACKGROUND: The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner. RESULTS: In this paper, we review and propose a series of methods and tools utilized in adjunct scientific disciplines that can be used to enhance the reproducibility of research using electronic health records and enable researchers to report analytical approaches in a transparent manner. Specifically, we discuss the adoption of scientific software engineering principles and best-practices such as test-driven development, source code revision control systems, literate programming and the standardization and re-use of common data management and analytical approaches. CONCLUSION: The adoption of such approaches will enable scientists to systematically document and share EHR analytical workflows and increase the reproducibility of biomedical research using such complex data sources

    The BioLighthouse: Reusable Software Design for Bioinformatics

    Get PDF
    Advances in next-generation sequencing have accelerated the field of microbiology by making accessible a wealth of information about microbiomes. Unfortunately, microbiome experiments are among the least reproducible in terms of bioinformatics. Software tools are often poorly documented, under-maintained, and commonly have arcane dependencies requiring significant time investment to configure them correctly. Microbiome studies are multidisciplinary efforts but communication and knowledge discrepancies make accessibility, reproducibility, and transparency of computational workflows difficult. The BioLighthouse uses Ansible roles, playbooks, and modules to automate configuration and execution of bioinformatics workflows. The roles and playbooks act as virtual laboratory notebooks by documenting the provenance of a bioinformatics workflow. The BioLighthouse was tested for platform dependence and data-scale dependence with a microbial profiling pipeline. The microbial profiling pipeline consisted of Cutadapt, FLASH2, and DADA2. The pipeline was tested on 3 canola root and soil microbiome datasets with differing orders of magnitude of data: 1 sample, 10 samples, and 100 samples. Each dataset was processed by The BioLighthouse with 10 unique parameter sets and outputs were compared across 8 computing environments for a total of 240 pipeline runs. Outputs after each step in the pipeline were tested for identity using the Linux diff command to ensure reproducible results. Testing of The BioLighthouse suggested no platform or data-scale dependence. To provide an easy way of maintaining environment reproducibility in user-space, Conda and the channel Bioconda were used for virtual environments and software dependencies for configuring bioinformatics tools. The BioLighthouse provides a framework for developers to make their tools accessible to the research community, for bioinformaticians to build bioinformatics workflows, and for the broader research community to consume these tools at a high level while knowing the tools will execute as intended

    Ontology of core data mining entities

    Get PDF
    In this article, we present OntoDM-core, an ontology of core data mining entities. OntoDM-core defines themost essential datamining entities in a three-layered ontological structure comprising of a specification, an implementation and an application layer. It provides a representational framework for the description of mining structured data, and in addition provides taxonomies of datasets, data mining tasks, generalizations, data mining algorithms and constraints, based on the type of data. OntoDM-core is designed to support a wide range of applications/use cases, such as semantic annotation of data mining algorithms, datasets and results; annotation of QSAR studies in the context of drug discovery investigations; and disambiguation of terms in text mining. The ontology has been thoroughly assessed following the practices in ontology engineering, is fully interoperable with many domain resources and is easy to extend
    corecore