5,118 research outputs found

    BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments

    Get PDF
    Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process

    Technical Report: A Trace-Based Performance Study of Autoscaling Workloads of Workflows in Datacenters

    Get PDF
    To improve customer experience, datacenter operators offer support for simplifying application and resource management. For example, running workloads of workflows on behalf of customers is desirable, but requires increasingly more sophisticated autoscaling policies, that is, policies that dynamically provision resources for the customer. Although selecting and tuning autoscaling policies is a challenging task for datacenter operators, so far relatively few studies investigate the performance of autoscaling for workloads of workflows. Complementing previous knowledge, in this work we propose the first comprehensive performance study in the field. Using trace-based simulation, we compare state-of-the-art autoscaling policies across multiple application domains, workload arrival patterns (e.g., burstiness), and system utilization levels. We further investigate the interplay between autoscaling and regular allocation policies, and the complexity cost of autoscaling. Our quantitative study focuses not only on traditional performance metrics and on state-of-the-art elasticity metrics, but also on time- and memory-related autoscaling-complexity metrics. Our main results give strong and quantitative evidence about previously unreported operational behavior, for example, that autoscaling policies perform differently across application domains and by how much they differ.Comment: Technical Report for the CCGrid 2018 submission "A Trace-Based Performance Study of Autoscaling Workloads of Workflows in Datacenters

    A Taxonomy of Workflow Management Systems for Grid Computing

    Full text link
    With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure

    Online platform for building, testing and deploying predictive models

    Get PDF
    Machine Learning (ML) and Artificial Intelligence (AI) have been traditionally built and deployed manually in a single machine, using tools such as R or Weka. Times are changing and in the real-time service and big data era, this methods are being obsoleted, as they severely limit the applicability and deployability of ML. Many companies such as Microsoft, Amazon and Google have been trying to mitigate this problem developing their MLaaS (Machine Learning as a Service) solutions, which are online platforms capable to scale and automate the development of predictive models. Despite the existence of some ML platforms available in the cloud, that enable the user to develop and deploy ML processes, they are not suitable for rapidly prototype and deploy predictive models, as some complex steps need to be done before the user starts using them, like configuration of environments, configuration of accounts and the overcome of the steep learning curve. In this research project, it’s presented MLINO, which is a concept of an online platform that allows the user to rapidly prototype and deploy basic ML processes, in an intuitive and easy way. Even though the implementation of the prototype wasn’t the optimal, due to software and infrastructure limitations, through a series of experiments it was demonstrated that the final performance of the prototype was satisfactory. When benchmarking the devised solution against the Microsoft Azure ML, the results showed that MLINO tool is easier to use, and takes less time when building and deploying a basic predictive model

    Concurrent software architectures for exploratory data analysis

    Get PDF
    Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop-based architectures can no longer cope with the ever-growing size and complexity of data. Next-generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine-proposed analysis workflows, preemptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers

    A proposed case for the cloud software engineering in security

    Get PDF
    This paper presents Cloud Software Engineering in Security (CSES) proposal that combines the benefits from each of good software engineering process and security. While other literature does not provide a proposal for Cloud security as yet, we use Business Process Modeling Notation (BPMN) to illustrate the concept of CSES from its design, implementation and test phases. BPMN can be used to raise alarm for protecting Cloud security in a real case scenario in real-time. Results from BPMN simulations show that a long execution time of 60 hours is required to protect real-time security of 2 petabytes (PB). When data is not in use, BPMN simulations show that the execution time for all data security rapidly falls off. We demonstrate a proposal to deal with Cloud security and aim to improve its current performance for Big Data
    • …
    corecore