7 research outputs found

    Leistungsmessung und Leistungsbewertung von NoSQL-Datenbanken

    Get PDF
    Das Ziel der vorliegenden Masterthesis ist es, einen Überblick der verschiedenen Datenbanktypen und Leistungsanalysen zu geben. Die vergleichende Literaturstudie beschäftigt sich mit einem jungen Forschungsfeld und betrachtet insbesondere nichtrelationale NoSQL-Datenbanken, welche in den letzten Jahren immer beliebter geworden sind und einige Vorteile gegenüber relationalen Datenbanken aufweisen. Doch was können die konkreten Datenbankimplementierungen bei unterschiedliche Datenmodellen leisten und welcher Testaufbau bietet sich bei welchen Einsatzanforderungen an? Zu Anfang definiert diese Arbeit Kriterien zur Bewertung von Leistung und untersucht experimentelle Vorgehensweise verschiedener Forscher. Ein wichtiger Fokus liegt darauf, die Vergleichbarkeit der Messmethoden und Ergebnisse einzuschätzen und zu gewährleisten. Neben dem methodischen Vorgehen wird mit dem YCSB-Framework ein wichtiges Werkzeug besprochen, mit dem Leistungsmessungen in NoSQL-Datenbanken implementiert werden können

    On Pattern Mining in Graph Data to Support Decision-Making

    Get PDF
    In recent years graph data models became increasingly important in both research and industry. Their core is a generic data structure of things (vertices) and connections among those things (edges). Rich graph models such as the property graph model promise an extraordinary analytical power because relationships can be evaluated without knowledge about a domain-specific database schema. This dissertation studies the usage of graph models for data integration and data mining of business data. Although a typical company's business data implicitly describes a graph it is usually stored in multiple relational databases. Therefore, we propose the first semi-automated approach to transform data from multiple relational databases into a single graph whose vertices represent domain objects and whose edges represent their mutual relationships. This transformation is the base of our conceptual framework BIIIG (Business Intelligence with Integrated Instance Graphs). We further proposed a graph-based approach to data integration. The process is executed after the transformation. In established data mining approaches interrelated input data is mostly represented by tuples of measure values and dimension values. In the context of graphs these values must be attached to the graph structure and aggregated measure values are graph attributes. Since the latter was not supported by any existing model, we proposed the use of collections of property graphs. They act as data structure of the novel Extended Property Graph Model (EPGM). The model supports vertices and edges that may appear in different graphs as well as graph properties. Further on, we proposed some operators that benefit from this data structure, for example, graph-based aggregation of measure values. A primitive operation of graph pattern mining is frequent subgraph mining (FSM). However, existing algorithms provided no support for directed multigraphs. We extended the popular gSpan algorithm to overcome this limitation. Some patterns might not be frequent while their generalizations are. Generalized graph patterns can be mined by attaching vertices to taxonomies. We proposed a novel approach to Generalized Multidimensional Frequent Subgraph Mining (GM-FSM), in particular the first solution to generalized FSM that supports not only directed multigraphs but also multiple dimensional taxonomies. In scenarios that compare patterns of different categories, e.g., fraud or not, FSM is not sufficient since pattern frequencies may differ by category. Further on, determining all pattern frequencies without frequency pruning is not an option due to the computational complexity of FSM. Thus, we developed an FSM extension to extract patterns that are characteristic for a specific category according to a user-defined interestingness function called Characteristic Subgraph Mining (CSM). Parts of this work were done in the context of GRADOOP, a framework for distributed graph analytics. To make the primitive operation of frequent subgraph mining available to this framework, we developed Distributed In-Memory gSpan (DIMSpan), a frequent subgraph miner that is tailored to the characteristics of shared-nothing clusters and distributed dataflow systems. Finally, the results of use case evaluations in cooperation with a large scale enterprise will be presented. This includes a report of practical experiences gained in implementation and application of the proposed algorithms

    Deletion of content in large cloud storage systems

    Get PDF
    This thesis discusses the practical implications and challenges of providing secure deletion of data in cloud storage systems. Secure deletion is a desirable functionality to some users, but a requirement to others. The term secure deletion describes the practice of deleting data in such a way, that it can not be reconstructed later, even by forensic means. This work discuss the practice of secure deletion as well as existing methods that are used today. When moving from traditional on-site data storage to cloud services, these existing methods are not applicable anymore. For this reason, it presents the concept of cryptographic deletion and points out the challenge behind implementing it in a practical way. A discussion of related work in the areas of data encryption and cryptographic deletion shows that a research gap exists in applying cryptographic deletion in an efficient, practical way to cloud storage systems. The main contribution of this thesis, the Key-Cascade method, solves this issue by providing an efficient data structure for managing large numbers of encryption keys. Secure deletion is practiced today by individuals and organizations, who need to protect the confidentiality of data, after it has been deleted. It is mostly achieved by means of physical destruction or overwriting in local hard disks or large storage systems. However, these traditional methods ofoverwriting data or destroying media are not suited to large, distributed, and shared cloud storage systems. The known concept of cryptographic deletion describes storing encrypted data in an untrusted storage system, while keeping the key in a trusted location. Given that the encryption is effective, secure deletion of the data can now be achieved by securely deleting the key. Whether encryption is an acceptable protection mechanism, must be decided either by legislature or the customers themselves. This depends on whether cryptographic deletion is done to satisfy legal requirements or customer requirements. The main challenge in implementing cryptographic deletion lies in the granularity of the delete operation. Storage encryption providers today either require deleting the master key, which deletes all stored data, or require expensive copy and re-encryption operations. In the literature, a few constructions can be found that provide an optimized key management. The contributions of this thesis, found in the Key-Cascade method, expand on those findings and describe data structures and operations for implementing efficient cryptographic deletion in a cloud object store. This thesis discusses the conceptual aspects of the Key-Cascade method as well as its mathematical properties. In order to enable production use of a Key-Cascade implementation, it presents multiple extensions to the concept. These extensions improve the performance and usability and also enable frictionless integration into existing applications. With SDOS, the Secure Delete Object Store, a working implementation of the concepts and extensions is given. Its design as an API proxy is unique among the existing cryptographic deletion systems and allows integration into existing applications, without the need to modify them. The results of performance evaluations, conducted with SDOS, show that cryptographic deletion is feasible in practice. With MCM, the Micro Content Management system, this thesis also presents a larger demonstrator system for SDOS. MCM provides insight into how SDOS can be integrated into and deployed as part of a cloud data management application

    Clustering Approaches for Multi-source Entity Resolution

    Get PDF
    Entity Resolution (ER) or deduplication aims at identifying entities, such as specific customer or product descriptions, in one or several data sources that refer to the same real-world entity. ER is of key importance for improving data quality and has a crucial role in data integration and querying. The previous generation of ER approaches focus on integrating records from two relational databases or performing deduplication within a single database. Nevertheless, in the era of Big Data the number of available data sources is increasing rapidly. Therefore, large-scale data mining or querying systems need to integrate data obtained from numerous sources. For example, in online digital libraries or E-Shops, publications or products are incorporated from a large number of archives or suppliers across the world or within a specified region or country to provide a unified view for the user. This process requires data consolidation from numerous heterogeneous data sources, which are mostly evolving. By raising the number of sources, data heterogeneity and velocity as well as the variance in data quality is increased. Therefore, multi-source ER, i.e. finding matching entities in an arbitrary number of sources, is a challenging task. Previous efforts for matching and clustering entities between multiple sources (> 2) mostly treated all sources as a single source. This approach excludes utilizing metadata or provenance information for enhancing the integration quality and leads up to poor results due to ignorance of the discrepancy between quality of sources. The conventional ER pipeline consists of blocking, pair-wise matching of entities, and classification. In order to meet the new needs and requirements, holistic clustering approaches that are capable of scaling to many data sources are needed. The holistic clustering-based ER should further overcome the restriction of pairwise linking of entities by making the process capable of grouping entities from multiple sources into clusters. The clustering step aims at removing false links while adding missing true links across sources. Additionally, incremental clustering and repairing approaches need to be developed to cope with the ever-increasing number of sources and new incoming entities. To this end, we developed novel clustering and repairing schemes for multi-source entity resolution. The approaches are capable of grouping entities from multiple clean (duplicate-free) sources, as well as handling data from an arbitrary combination of clean and dirty sources. The multi-source clustering schemes exclusively developed for multi-source ER can obtain superior results compared to general purpose clustering algorithms. Additionally, we developed incremental clustering and repairing methods in order to handle the evolving sources. The proposed incremental approaches are capable of incorporating new sources as well as new entities from existing sources. The more sophisticated approach is able to repair previously determined clusters, and consequently yields improved quality and a reduced dependency on the insert order of the new entities. To ensure scalability, the parallel variation of all approaches are implemented on top of the Apache Flink framework which is a distributed processing engine. The proposed methods have been integrated in a new end-to-end ER tool named FAMER (FAst Multi-source Entity Resolution system). The FAMER framework is comprised of Linking and Clustering components encompassing both batch and incremental ER functionalities. The output of Linking part is recorded as a similarity graph where each vertex represents an entity and each edge maintains the similarity relationship between two entities. Such a similarity graph is the input of the Clustering component. The comprehensive comparative evaluations overall show that the proposed clustering and repairing approaches for both batch and incremental ER achieve high quality while maintaining the scalability

    Large Scale Qualitative Spatio-Temporal Reasoning

    Get PDF
    This thesis considers qualitative spatio-temporal reasoning (QSTR), a branch of artificial intelligence that is concerned with qualitative spatial and temporal relations between entities. Despite QSTR being an active area of research for many years, there has been comparatively little work looking at large scale qualitative spatio-temporal reasoning - reasoning using hundreds of thousands or millions of relations. The big data phenomenon of recent years means there is now a requirement for QSTR implementations that will scale effectively and reason using large scale datasets. However, existing reasoners are limited in their scalability, what is needed are new approaches to QSTR. This thesis considers whether parallel distributed programming techniques can be used to address the challenges of large scale QSTR. Specifically, this thesis presents the first in-depth investigation of adapting QSTR techniques to work in a distributed environment. This has resulted in a large scale qualitative spatial reasoner, ParQR, which has been evaluated by comparing it with existing reasoners and alternative approaches to large scale QSTR. ParQR has been shown to outperform existing solutions, reasoning using far larger datasets than previously possible. The thesis then considers a specific application of large scale QSTR, querying knowledge graphs. This has two parts to it. First, integrating large scale complex spatial datasets to generate an enhanced knowledge graph that can support qualitative spatial reasoning, and secondly, adapting parallel, distributed QSTR techniques to implement a query answering system for spatial knowledge graphs. The query engine that has been developed is able to provide solutions to a variety of spatial queries. It has been evaluated and shown to provide more comprehensive query results in comparison to using quantitative only techniques
    corecore