410 research outputs found

    ProvMark:A Provenance Expressiveness Benchmarking System

    Get PDF
    System level provenance is of widespread interest for applications such as security enforcement and information protection. However, testing the correctness or completeness of provenance capture tools is challenging and currently done manually. In some cases there is not even a clear consensus about what behavior is correct. We present an automated tool, ProvMark, that uses an existing provenance system as a black box and reliably identifies the provenance graph structure recorded for a given activity, by a reduction to subgraph isomorphism problems handled by an external solver. ProvMark is a beginning step in the much needed area of testing and comparing the expressiveness of provenance systems. We demonstrate ProvMark's usefuless in comparing three capture systems with different architectures and distinct design philosophies.Comment: To appear, Middleware 201

    A Markov model for inferring flows in directed contact networks

    Full text link
    Directed contact networks (DCNs) are a particularly flexible and convenient class of temporal networks, useful for modeling and analyzing the transfer of discrete quantities in communications, transportation, epidemiology, etc. Transfers modeled by contacts typically underlie flows that associate multiple contacts based on their spatiotemporal relationships. To infer these flows, we introduce a simple inhomogeneous Markov model associated to a DCN and show how it can be effectively used for data reduction and anomaly detection through an example of kernel-level information transfers within a computer.Comment: 12 page

    Analysing system behaviour by automatic benchmarking of system-level provenance

    Get PDF
    Provenance is a term originating from the work of art. It aims to provide a chain of information of a piece of arts from its creation to the current status. It records all the historic information relating to this piece of art, including the storage locations, ownership, buying prices, etc. until the current status. It has a very similar definition in data processing and computer science. It is used as the lineage of data in computer science to provide either reproducibility or tracing of activities happening in runtime for a different purpose. Similar to the provenance used in art, provenance used in computer science and data processing field describes how a piece of data was created, passed around, modified, and reached the current state. Also, it provides information on who is responsible for certain activities and other related information. It acts as metadata on components in a computer environment. As the concept of provenance is to record all related information of some data, the size of provenance itself is generally proportional to the amount of data processing that took place. It generally tends to be a large set of data and is hard to analyse. Also, in the provenance collecting process, not all information is useful for all purposes. For example, if we just want to trace all previous owners of a file, then all the storage location information may be ignored. To capture useful information and without needing to handle a large amount of information, researchers and developers develop different provenance recording tools that only record information needed by particular applications with different means and mechanisms throughout the systems. This action allows a lighter set of information for analysis but it results in non-standard provenance information and general users may not have a clear view on which tools are better for some purposes. For example, if we want to identify if certain action sequences have been performed in a process and who is accountable for these actions for security analysis, we have no idea which tools should be trusted to provide the correct set of information. Also, it is hard to compare the tools as there is not much common standard around. With the above need in mind, this thesis concentrate on providing an automated system ProvMark to benchmark the tools. This helps to show the strengths and weaknesses of their provenance results in different scenarios. It also allows tool developers to verify their tools and allows end-users to compare the tools at the same level to choose a suitable one for the purpose. As a whole, the benchmarking based on the expressiveness of the tools on different scenarios shows us the right choice of provenance tools on specific usage

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    Design considerations for workflow management systems use in production genomics research and the clinic

    Get PDF
    Abstract The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance

    A survey of general-purpose experiment management tools for distributed systems

    Get PDF
    International audienceIn the field of large-scale distributed systems, experimentation is particularly difficult. The studied systems are complex, often nondeterministic and unreliable, software is plagued with bugs, whereas the experiment workflows are unclear and hard to reproduce. These obstacles led many independent researchers to design tools to control their experiments, boost productivity and improve quality of scientific results. Despite much research in the domain of distributed systems experiment management, the current fragmentation of efforts asks for a general analysis. We therefore propose to build a framework to uncover missing functionality of these tools, enable meaningful comparisons be-tween them and find recommendations for future improvements and research. The contribution in this paper is twofold. First, we provide an extensive list of features offered by general-purpose experiment management tools dedicated to distributed systems research on real platforms. We then use it to assess existing solutions and compare them, outlining possible future paths for improvements

    The Vadalog System: Datalog-based Reasoning for Knowledge Graphs

    Full text link
    Over the past years, there has been a resurgence of Datalog-based systems in the database community as well as in industry. In this context, it has been recognized that to handle the complex knowl\-edge-based scenarios encountered today, such as reasoning over large knowledge graphs, Datalog has to be extended with features such as existential quantification. Yet, Datalog-based reasoning in the presence of existential quantification is in general undecidable. Many efforts have been made to define decidable fragments. Warded Datalog+/- is a very promising one, as it captures PTIME complexity while allowing ontological reasoning. Yet so far, no implementation of Warded Datalog+/- was available. In this paper we present the Vadalog system, a Datalog-based system for performing complex logic reasoning tasks, such as those required in advanced knowledge graphs. The Vadalog system is Oxford's contribution to the VADA research programme, a joint effort of the universities of Oxford, Manchester and Edinburgh and around 20 industrial partners. As the main contribution of this paper, we illustrate the first implementation of Warded Datalog+/-, a high-performance Datalog+/- system utilizing an aggressive termination control strategy. We also provide a comprehensive experimental evaluation.Comment: Extended version of VLDB paper <https://doi.org/10.14778/3213880.3213888

    Special Issue on High-Level Declarative Stream Processing

    Get PDF
    Stream processing as an information processing paradigm has been investigated by various research communities within computer science and appears in various applications: realtime analytics, online machine learning, continuous computation, ETL operations, and more. The special issue on "High-Level Declarative Stream Processing" investigates the declarative aspects of stream processing, a topic of undergoing intense study. It is published in the Open Journal of Web Technologies (OJWT) (www.ronpub.com/ojwt). This editorial provides an overview over the aims and the scope of the special issue and the accepted papers
    • …
    corecore