88 research outputs found

    Secure Time-Aware Provenance for Distributed Systems

    Get PDF
    Operators of distributed systems often find themselves needing to answer forensic questions, to perform a variety of managerial tasks including fault detection, system debugging, accountability enforcement, and attack analysis. In this dissertation, we present Secure Time-Aware Provenance (STAP), a novel approach that provides the fundamental functionality required to answer such forensic questions – the capability to “explain” the existence (or change) of a certain distributed system state at a given time in a potentially adversarial environment. This dissertation makes the following contributions. First, we propose the STAP model, to explicitly represent time and state changes. The STAP model allows consistent and complete explanations of system state (and changes) in dynamic environments. Second, we show that it is both possible and practical to efficiently and scalably maintain and query provenance in a distributed fashion, where provenance maintenance and querying are modeled as recursive continuous queries over distributed relations. Third, we present security extensions that allow operators to reliably query provenance information in adversarial environments. Our extensions incorporate tamper-evident properties that guarantee eventual detection of compromised nodes that lie or falsely implicate correct nodes. Finally, the proposed research results in a proof-of-concept prototype, which includes a declarative query language for specifying a range of useful provenance queries, an interactive exploration tool, and a distributed provenance engine for operators to conduct analysis of their distributed systems. We discuss the applicability of this tool in several use cases, including Internet routing, overlay routing, and cloud data processing

    Extending graph homomorphism and simulation for real life graph matching

    Get PDF
    Among the vital problems in a variety of emerging applications is the graph matching problem, which is to determine whether two graphs are similar, and if so, find all the valid matches in one graph for the other, based on specified metrics. Traditional graph matching approaches are mostly based on graph homomorphism and isomorphism, falling short of capturing both structural and semantic similarity in real life applications. Moreover, it is preferable while difficult to find all matches with high accuracy over complex graphs. Worse still, the graph structures in real life applications constantly bear modifications. In response to these challenges, this thesis presents a series of approaches for ef?ciently solving graph matching problems, over both static and dynamic real life graphs. Firstly, the thesis extends graph homomorphism and subgraph isomorphism, respectively, by mapping edges from one graph to paths in another, and by measuring the semantic similarity of nodes. The graph similarity is then measured by the metrics based on these extensions. Several optimization problems for graph matching based on the new metrics are studied, with approximation algorithms having provable guarantees on match quality developed. Secondly, although being extended in the above work, graph matching is defined in terms of functions, which cannot capture more meaningful matches and is usually hard to compute. In response to this, the thesis proposes a class of graph patterns, in which an edge denotes the connectivity in a data graph within a predefined number of hops. In addition, the thesis defines graph pattern matching based on a notion of bounded simulation relation, an extension of graph simulation. With this revision, graph pattern matching is in cubic-time by providing such an algorithm, rather than intractable. Thirdly, real life graphs often bear multiple edge types. In response to this, the thesis further extends and generalizes the proposed revisions of graph simulation to a more powerful case: a novel set of reachability queries and graph pattern queries, constrained by a subclass of regular path expressions. Several fundamental problems of the queries are studied: containment, equivalence and minimization. The enriched reachability query does not increase the complexity of the above problems, shown by the corresponding algorithms. Moreover, graph pattern queries can be evaluated in cubic time, where two such algorithms are proposed. Finally, real life graphs are frequently updated with small changes. The thesis investigates incremental algorithms for graph pattern matching defined in terms of graph simulation, bounded simulation and subgraph isomorphism. Besides studying the results on the complexity bounds, the thesis provides the experimental study verifying that these incremental algorithms significantly outperform their batch counterparts in response to small changes, using real-life data and synthetic data

    i3MAGE: Incremental, Interactive, Inter-Model Mapping Generation

    Full text link
    Data integration is a highly important prerequisite for most enterprise data analyses. While hard in general, a particular concern is about human effort for designing a global integration schema, authoring queries against that schema, and creating mappings to connect data sources with the global schema. Ontology-based data integration (OBDI), which employs ontologies as a target model, reduces the effort for schema design and usage. On the other side, it requires mappings that are particularly difficult to create. Architects who work with OBDI hence need systems to support the process of mapping development. One key type of tooling to support mapping development is automatic or semi-automatic generation of mapping suggestions. While many such tools exist in the wider sphere of data integration, few are built to work in the case of OBDI, where the inter-model gap between relational input schemata and a target ontology has to be bridged. Among those that support OBDI at all, none so far are fully optimized for this specific case by performing a truly inter-model matching while also leveraging distinct but corresponding aspects of both models. We propose i3MAGE, an approach and a system for automatic and semi-automatic generation of mappings in OBDI. The system is built on generic inter-model matching, and it is optimized in various ways for matching relational source schemata to target ontology schemata. To be truly semi-automatic in every respect, i3MAGE works both incrementally, building mappings pay-as-you-go, and interactively in exchange with a human user. We introduce a specialized benchmark and evaluate i3MAGE against a number of other approaches. In addition, we provide examples, where i3MAGE can be deployed in holistic data integration environments

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Personalizable Knowledge Integration

    Get PDF
    Large repositories of data are used daily as knowledge bases (KBs) feeding computer systems that support decision making processes, such as in medical or financial applications. Unfortunately, the larger a KB is, the harder it is to ensure its consistency and completeness. The problem of handling KBs of this kind has been studied in the AI and databases communities, but most approaches focus on computing answers locally to the KB, assuming there is some single, epistemically correct solution. It is important to recognize that for some applications, as part of the decision making process, users consider far more knowledge than that which is contained in the knowledge base, and that sometimes inconsistent data may help in directing reasoning; for instance, inconsistency in taxpayer records can serve as evidence of a possible fraud. Thus, the handling of this type of data needs to be context-sensitive, creating a synergy with the user in order to build useful, flexible data management systems. Inconsistent and incomplete information is ubiquitous and presents a substantial problem when trying to reason about the data: how can we derive an adequate model of the world, from the point of view of a given user, from a KB that may be inconsistent or incomplete? In this thesis we argue that in many cases users need to bring their application-specific knowledge to bear in order to inform the data management process. Therefore, we provide different approaches to handle, in a personalized fashion, some of the most common issues that arise in knowledge management. Specifically, we focus on (1) inconsistency management in relational databases, general knowledge bases, and a special kind of knowledge base designed for news reports; (2) management of incomplete information in the form of different types of null values; and (3) answering queries in the presence of uncertain schema matchings. We allow users to define policies to manage both inconsistent and incomplete information in their application in a way that takes both the user's knowledge of his problem, and his attitude to error/risk, into account. Using the frameworks and tools proposed here, users can specify when and how they want to manage/solve the issues that arise due to inconsistency and incompleteness in their data, in the way that best suits their needs

    Open City Data Pipeline

    Get PDF
    Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality.Series: Working Papers on Information Systems, Information Business and Operation

    Gridfields: Model-Driven Data Transformation in the Physical Sciences

    Get PDF
    Scientists\u27 ability to generate and store simulation results is outpacing their ability to analyze them via ad hoc programs. We observe that these programs exhibit an algebraic structure that can be used to facilitate reasoning and improve performance. In this dissertation, we present a formal data model that exposes this algebraic structure, then implement the model, evaluate it, and use it to express, optimize, and reason about data transformations in a variety of scientific domains. Simulation results are defined over a logical grid structure that allows a continuous domain to be represented discretely in the computer. Existing approaches for manipulating these gridded datasets are incomplete. The performance of SQL queries that manipulate large numeric datasets is not competitive with that of specialized tools, and the up-front effort required to deploy a relational database makes them unpopular for dynamic scientific applications. Tools for processing multidimensional arrays can only capture regular, rectilinear grids. Visualization libraries accommodate arbitrary grids, but no algebra has been developed to simplify their use and afford optimization. Further, these libraries are data dependent—physical changes to data characteristics break user programs. We adopt the grid as a first-class citizen, separating topology from geometry and separating structure from data. Our model is agnostic with respect to dimension, uniformly capturing, for example, particle trajectories (1-D), sea-surface temperatures (2-D), and blood flow in the heart (3-D). Equipped with data, a grid becomes a gridfield. We provide operators for constructing, transforming, and aggregating gridfields that admit algebraic laws useful for optimization. We implement the model by analyzing several candidate data structures and incorporating their best features. We then show how to deploy gridfields in practice by injecting the model as middleware between heterogeneous, ad hoc file formats and a popular visualization library. In this dissertation, we define, develop, implement, evaluate and deploy a model of gridded datasets that accommodates a variety of complex grid structures and a variety of complex data products. We evaluate the applicability and performance of the model using datasets from oceanography, seismology, and medicine and conclude that our model-driven approach offers significant advantages over the status quo

    Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

    Get PDF
    The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems
    • 

    corecore