6 research outputs found

    A Novel Data Lineage Model for Critical Infrastructure and a Solution to a Special Case of the Temporal Graph Reachability Problem

    Get PDF
    Rapid and accurate damage assessment is crucial to minimize downtime in critical infrastructure. Dependency on modern technology requires fast and consistent techniques to prevent damage from spreading while also minimizing the impact of damage on system users. One technique to assist in assessment is data lineage, which involves tracing a history of dependencies for data items. The goal of this thesis is to present one novel model and an algorithm that uses data lineage with the goal of being fast and accurate. In function this model operates as a directed graph, with the vertices being data items and edges representing dependencies. Additionally, data is grouped into multiple layers which allows for faster partial damage assessment. Lower layers of the graph consist of more granular data items, while higher layers consist of containers of lower layer data items. By assessing a higher layer, one can immediately conclude that certain portions of the system are undamaged, and those portions may begin operation again. In practice, graph creation is a front-loaded operation that allows immediate action at the time of damage assessment. Depending on the system, this graph will often be cyclic which causes standard assessment to be a computationally slow problem. By tracking the time of dependencies, our graph operates as a subclass of temporal graph, which are graphs that change over time. By taking advantage of unique properties of this subclass, our algorithm is able to function in a way that is nearly only dependent on the number of edges. Put together, the model can run quickly, free up undamaged portions of the system during assessment, and find the minimum amount of damage which needs to be manually assessed

    Computing candidate keys of relational operators for optimizing rewrite-based provenance computation : key property module

    Get PDF
    Data provenance provides information about the origin of data, and has long attracted the attention of the database community. It has been proven to be essential for a wide range of use cases from debugging of data and queries to probabilistic databases. There exist different techniques for computing the data provenance of a query. However, even sophisticated database optimizers are usually incapable of producing an efficient execution plan for provenance computations because of their inherent complexity and unusual structure. In this work, I develop the key property module, as part of the heuristic optimization techniques for rewrite-based provenance systems to address this problem and present an implementation of this module in the GProM provenance middle-ware system. The key property stores the set of candidate keys for the output relation of a relational algebra operator. This property is important for evaluating the precondition of many heuristic rewrite rules applied by GProM, e.g., rules that reduce the number of duplicate removal operators in a query. To complete this work, I provide an experimental evaluation which confirms that this property is extremely useful for improving the performance at game provenance.La procedencia de datos proporciona información sobre el origen de los datos, y ha atraído mucho la atención de la comunidad de investigación en bases de datos. Se ha demostrado que es esencial para una amplia gama de casos, desde debugging de datos y consultas hasta bases de datos probabilísticos. Existen diferentes técnicas para el cálculo de la procedencia de datos de una consulta. Sin embargo, incluso los optimizadores de bases de datos sofisticados suelen ser incapaces de producir un plan de ejecución eficiente para cálculos de procedencia debido a su complejidad inherente y suestructura inusual. A lo largo de este trabajo, desarrollo el módulo para inferir la propiedad clave a los operadores, como parte de las técnicas de optimización heurística para sistemas de procedencia de datos basados en la reescritura para hacer frente al problema de optimización y presentar una implementación de este módulo en el sistema middleware de procedencia GProM. La propiedad clave almacena el conjunto de claves candidatas para la relación de salida de un operador de álgebra relacional. Esta propiedad es importante para evaluar la condición previa de muchas reglas de reescritura heurísticas aplicados por el sistema GProM, por ejemplo, las normas que reducen el número de operadores de eliminación de duplicados en una consulta. Para completar este trabajo, proporciono una evaluación experimental que confirma que esta propiedad es extremadamente útil para mejorar el rendimiento en el juego de procedencia.La procedència de dades proporciona informació sobre l’origen de les dades, i ha atret molt l’atenció de la comunitat de recerca en bases de dades. S’ha demostrat que és essencial per a una àmplia gamma de casos, des de debugging de dades i consultes fins a bases de dades probabilístiques. Existeixen diferents tècniques per al càlcul de la procedència de dades d’una consulta. No obstant això, fins i tot els optimitzadors de bases de dades sofisticats solen ser incapaços de produir un pla d’execució eficient per a càlculs de procedència a causa de la seva complexitat inherent i la seva estructura inusual. Al llarg d’aquest treball, desenvolupo un mòdul per inferir la propietat clau als operadors, com a part de les tècniques d’optimització heurística per a sistemes de procedència de dades basades en la reescriptura per fer front al problema d’optimització i presentar una implementació d’aquest mòdul en el sistema middleware de procedència GProM. La propietat clau emmagatzema el conjunt de claus candidates per a la relació de sortida d’un operador d’àlgebra relacional. Aquesta propietat és important per avaluar la condició prèvia de moltes regles de reescriptura heurístiques aplicats pel sistema GProM, per exemple, les normes que redueixen el nombre d’operadors d’eliminació de duplicats en una consulta. Per completar aquest projecte, proporciono una avaluació experimental que confirma que aquesta propietat és extremadament útil per millorar el rendiment en el joc de procedència

    Cloud Services Brokerage for Mobile Ubiquitous Computing

    Get PDF
    Recently, companies are adopting Mobile Cloud Computing (MCC) to efficiently deliver enterprise services to users (or consumers) on their personalized devices. MCC is the facilitation of mobile devices (e.g., smartphones, tablets, notebooks, and smart watches) to access virtualized services such as software applications, servers, storage, and network services over the Internet. With the advancement and diversity of the mobile landscape, there has been a growing trend in consumer attitude where a single user owns multiple mobile devices. This paradigm of supporting a single user or consumer to access multiple services from n-devices is referred to as the Ubiquitous Cloud Computing (UCC) or the Personal Cloud Computing. In the UCC era, consumers expect to have application and data consistency across their multiple devices and in real time. However, this expectation can be hindered by the intermittent loss of connectivity in wireless networks, user mobility, and peak load demands. Hence, this dissertation presents an architectural framework called, Cloud Services Brokerage for Mobile Ubiquitous Cloud Computing (CSB-UCC), which ensures soft real-time and reliable services consumption on multiple devices of users. The CSB-UCC acts as an application middleware broker that connects the n-devices of users to the multi-cloud services. The designed system determines the multi-cloud services based on the user's subscriptions and the n-devices are determined through device registration on the broker. The preliminary evaluations of the designed system shows that the following are achieved: 1) high scalability through the adoption of a distributed architecture of the brokerage service, 2) providing soft real-time application synchronization for consistent user experience through an enhanced mobile-to-cloud proximity-based access technique, 3) reliable error recovery from system failure through transactional services re-assignment to active nodes, and 4) transparent audit trail through access-level and context-centric provenance

    Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications

    Get PDF
    Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios

    On Provenance Minimization

    Get PDF
    Provenance information has been proved to be very effective in capturing the computational process performed by queries, and has been used extensively as the input to many advanced data management tools (e.g. view maintenance, trust assessment, or query answering in probabilistic databases). We study here the core of provenance information, namely the part of provenance that appears in the computation of every query equivalent to the given one. This provenance core is informative as it describes the part of the computational process that is inherent to the query. It is also useful as a compact input to the above mentioned data management tools. We study algorithms that, given a query, compute an equivalent query that realizes the core provenance for all tuples in its result. We study these algorithms for queries of varying expressive power. Finally, we observe that, in general, one would not want to require database systems to evaluate a specific query that realizes the core provenance, but instead to be able to find, possibly off-line, the core provenance of a given tuple in the output (computed by an arbitrary equivalent query), without rewriting the query. We provide algorithms for such direct computation of the core provenance
    corecore