22 research outputs found

    High-Fidelity Provenance:Exploring the Intersection of Provenance and Security

    Get PDF
    In the past 25 years, the World Wide Web has disrupted the way news are disseminated and consumed. However, the euphoria for the democratization of news publishing was soon followed by scepticism, as a new phenomenon emerged: fake news. With no gatekeepers to vouch for it, the veracity of the information served over the World Wide Web became a major public concern. The Reuters Digital News Report 2020 cites that in at least half of the EU member countries, 50% or more of the population is concerned about online fake news. To help address the problem of trust on information communi- cated over the World Wide Web, it has been proposed to also make available the provenance metadata of the information. Similar to artwork provenance, this would include a detailed track of how the information was created, updated and propagated to produce the result we read, as well as what agents—human or software—were involved in the process. However, keeping track of provenance information is a non-trivial task. Current approaches, are often of limited scope and may require modifying existing applications to also generate provenance information along with thei regular output. This thesis explores how provenance can be automatically tracked in an application-agnostic manner, without having to modify the individual applications. We frame provenance capture as a data flow analysis problem and explore the use of dynamic taint analysis in this context. Our work shows that this appoach improves on the quality of provenance captured compared to traditonal approaches, yielding what we term as high-fidelity provenance. We explore the performance cost of this approach and use deterministic record and replay to bring it down to a more practical level. Furthermore, we create and present the tooling necessary for the expanding the use of using deterministic record and replay for provenance analysis. The thesis concludes with an application of high-fidelity provenance as a tool for state-of-the art offensive security analysis, based on the intuition that software too can be misguided by "fake news". This demonstrates that the potential uses of high-fidelity provenance for security extend beyond traditional forensics analysis

    Analysing system behaviour by automatic benchmarking of system-level provenance

    Get PDF
    Provenance is a term originating from the work of art. It aims to provide a chain of information of a piece of arts from its creation to the current status. It records all the historic information relating to this piece of art, including the storage locations, ownership, buying prices, etc. until the current status. It has a very similar definition in data processing and computer science. It is used as the lineage of data in computer science to provide either reproducibility or tracing of activities happening in runtime for a different purpose. Similar to the provenance used in art, provenance used in computer science and data processing field describes how a piece of data was created, passed around, modified, and reached the current state. Also, it provides information on who is responsible for certain activities and other related information. It acts as metadata on components in a computer environment. As the concept of provenance is to record all related information of some data, the size of provenance itself is generally proportional to the amount of data processing that took place. It generally tends to be a large set of data and is hard to analyse. Also, in the provenance collecting process, not all information is useful for all purposes. For example, if we just want to trace all previous owners of a file, then all the storage location information may be ignored. To capture useful information and without needing to handle a large amount of information, researchers and developers develop different provenance recording tools that only record information needed by particular applications with different means and mechanisms throughout the systems. This action allows a lighter set of information for analysis but it results in non-standard provenance information and general users may not have a clear view on which tools are better for some purposes. For example, if we want to identify if certain action sequences have been performed in a process and who is accountable for these actions for security analysis, we have no idea which tools should be trusted to provide the correct set of information. Also, it is hard to compare the tools as there is not much common standard around. With the above need in mind, this thesis concentrate on providing an automated system ProvMark to benchmark the tools. This helps to show the strengths and weaknesses of their provenance results in different scenarios. It also allows tool developers to verify their tools and allows end-users to compare the tools at the same level to choose a suitable one for the purpose. As a whole, the benchmarking based on the expressiveness of the tools on different scenarios shows us the right choice of provenance tools on specific usage

    Yavaa: supporting data workflows from discovery to visualization

    Get PDF
    Recent years have witness an increasing number of data silos being opened up both within organizations and to the general public: Scientists publish their raw data as supplements to articles or even standalone artifacts to enable others to verify and extend their work. Governments pass laws to open up formerly protected data treasures to improve accountability and transparency as well as to enable new business ideas based on this public good. Even companies share structured information about their products and services to advertise their use and thus increase revenue. Exploiting this wealth of information holds many challenges for users, though. Oftentimes data is provided as tables whose sheer endless rows of daunting numbers are barely accessible. InfoVis can mitigate this gap. However, offered visualization options are generally very limited and next to no support is given in applying any of them. The same holds true for data wrangling. Only very few options to adjust the data to the current needs and barely any protection are in place to prevent even the most obvious mistakes. When it comes to data from multiple providers, the situation gets even bleaker. Only recently tools emerged to search for datasets across institutional borders reasonably. Easy-to-use ways to combine these datasets are still missing, though. Finally, results generally lack proper documentation of their provenance. So even the most compelling visualizations can be called into question when their coming about remains unclear. The foundations for a vivid exchange and exploitation of open data are set, but the barrier of entry remains relatively high, especially for non-expert users. This thesis aims to lower that barrier by providing tools and assistance, reducing the amount of prior experience and skills required. It covers the whole workflow ranging from identifying proper datasets, over possible transformations, up until the export of the result in the form of suitable visualizations

    Detached Provenance Analysis

    Get PDF
    Data provenance is the research field of the algorithmic derivation of the source and processing history of data. In this work, the derivation of Where- and Why-provenance in sub-cell-level granularity is pursued for a rich SQL dialect. For example, we support the provenance analysis for individual elements of nested rows and/or arrays. The SQL dialect incorporates window functions and correlated subqueries. We accomplish this goal using a novel method called detached provenance analysis. This method carries out a SQL-level rewrite of any user query Q, yielding (Q1, Q2). Employing two queries facilitates a low-invasive provenance analysis, i.e. both queries can be evaluated using an unmodified DBMS as backend. The queries implement a split of responsibilities: Q1 carries out a runtime analysis and Q2 derives the actual data provenance. One drawback of this method is that a synchronization overhead between Q1 and Q2 is induced. Experiments quantify the overheads based on the TPC-H benchmark and the PostgreSQL DBMS. A second set of experiments carried out in row–level granularity compares our approach with the PERM approach (as described by B. Glavic et al.). The aggregated results show that basic queries (typically, a single SFW expression with aggregations) perform slightly better in the PERM approach while complex queries (nested SFW expressions and correlated subqueries) perform considerably better in our approach
    corecore