47 research outputs found

    Flexibility in Data Management

    Get PDF
    With the ongoing expansion of information technology, new fields of application requiring data management emerge virtually every day. In our knowledge culture increasing amounts of data and work force organized in more creativity-oriented ways also radically change traditional fields of application and question established assumptions about data management. For instance, investigative analytics and agile software development move towards a very agile and flexible handling of data. As the primary facilitators of data management, database systems have to reflect and support these developments. However, traditional database management technology, in particular relational database systems, is built on assumptions of relatively stable application domains. The need to model all data up front in a prescriptive database schema earned relational database management systems the reputation among developers of being inflexible, dated, and cumbersome to work with. Nevertheless, relational systems still dominate the database market. They are a proven, standardized, and interoperable technology, well-known in IT departments with a work force of experienced and trained developers and administrators. This thesis aims at resolving the growing contradiction between the popularity and omnipresence of relational systems in companies and their increasingly bad reputation among developers. It adapts relational database technology towards more agility and flexibility. We envision a descriptive schema-comes-second relational database system, which is entity-oriented instead of schema-oriented; descriptive rather than prescriptive. The thesis provides four main contributions: (1)~a flexible relational data model, which frees relational data management from having a prescriptive schema; (2)~autonomous physical entity domains, which partition self-descriptive data according to their schema properties for better query performance; (3)~a freely adjustable storage engine, which allows adapting the physical data layout used to properties of the data and of the workload; and (4)~a self-managed indexing infrastructure, which autonomously collects and adapts index information under the presence of dynamic workloads and evolving schemas. The flexible relational data model is the thesis\' central contribution. It describes the functional appearance of the descriptive schema-comes-second relational database system. The other three contributions improve components in the architecture of database management systems to increase the query performance and the manageability of descriptive schema-comes-second relational database systems. We are confident that these four contributions can help paving the way to a more flexible future for relational database management technology

    A scalable analysis framework for large-scale RDF data

    Get PDF
    With the growth of the Semantic Web, the availability of RDF datasets from multiple domains as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges modern knowledge storage and discovery techniques. Research and engineering on RDF data management systems is a very active area with many standalone systems being introduced. However, as the size of RDF data increases, such single-machine approaches meet performance bottlenecks, in terms of both data loading and querying, due to the limited parallelism inherent to symmetric multi-threaded systems and the limited available system I/O and system memory. Although several approaches for distributed RDF data processing have been proposed, along with clustered versions of more traditional approaches, their techniques are limited by the trade-off they exploit between loading complexity and query efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis framework for processing large-scale RDF data, which focuses on various techniques to reduce inter-machine communication, computation and load-imbalancing so as to achieve fast data loading and querying on distributed infrastructures. The first part of this thesis focuses on the study of RDF store implementation and parallel hashing on big data processing. (1) A system-level investigation of RDF store implementation has been conducted on the basis of a comparative analysis of runtime characteristics of a representative set of RDF stores. The detailed time cost and system consumption is measured for data loading and querying so as to provide insight into different triple store implementation as well as an understanding of performance differences between different platforms. (2) A high-level structured parallel hashing approach over distributed memory is proposed and theoretically analyzed. The detailed performance of hashing implementations using different lock-free strategies has been characterized through extensive experiments, thereby allowing system developers to make a more informed choice for the implementation of their high-performance analytical data processing systems. The second part of this thesis proposes three main techniques for fast processing of large RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups compared to the state-of-art method and also has achieved excellent scalability. (2) Several novel parallel join algorithms, to efficiently handle skew over large data during query processing. The approaches have achieved good load balancing and have been demonstrated to be faster than the state-of-art techniques in both theoretical and experimental comparisons. (3) A two-tier dynamic indexing approach for processing SPARQL queries has been devised which keeps loading times low and decreases or in some instances removes intermachine data movement for subsequent queries that contain the same graph patterns. The results demonstrate that this design can load data at least an order of magnitude faster than a clustered store operating in RAM while remaining within an interactive range for query processing and even outperforms current systems for various queries

    Cypher: An Evolving Query Language for Property Graphs

    Get PDF
    International audienceThe Cypher property graph query language is an evolving language, originally designed and implemented as part of the Neo4j graph database, and it is currently used by several commercial database products and researchers. We describe Cypher 9, which is the first version of the language governed by the openCypher Implementers Group. We first introduce the language by example, and describe its uses in industry. We then provide a formal semantic definition of the core read-query features of Cypher, including its variant of the property graph data model, and its " ASCII Art " graph pattern matching mechanism for expressing subgraphs of interest to an application. We compare the features of Cypher to other property graph query languages, and describe extensions, at an advanced stage of development, which will form part of Cypher 10, turning the language into a compositional language which supports graph projections and multiple named graphs

    Uma abordagem para o particionamento de dados na nuvem baseada em relações de afinidade em grafos

    Get PDF
    Orientadora : Profª. Drª. Carmem Satie HaraTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 21/07/2014Inclui referênciasResumo: Os desafios atuais do gerenciamento de dados vêm sendo frequentemente associados ao termo Big Data. Este termo refere-se a um número crescente de aplicações caracterizadas pela produção de dados com alta variedade, grande volume, e que exigem velocidade em seu processamento. Ao mesmo tempo em que estes requisitos são identificados, o amadurecimento tecnológico associado à computação em nuvem alavancou uma mudança nos aspectos operacionais e econômicos da computação, sobretudo através de infraestruturas para o desenvolvimento de serviços escaláveis. A iniciativa do gerenciamento de dados sobre estas plataformas mostra-se adequada para tratar os desafios do Big Data através de um serviço de banco de dados em nuvem (Database as a Service). Uma forma de escalar aplicações que processam uma quantidade massiva de informações e através da fragmentação de grandes conjuntos de dados alocados sobre servidores de um sistema em nuvem. O principal problema associado a esta abordagem esta em particionar os dados de forma que consultas possam ser preferencialmente executadas de forma local para evitar o custo da troca de mensagens entre servidores. Em conjunto com este problema, a variedade de dados e o volume crescente associado as bases de dados em nuvem desafiam as soluções tradicionais para o particionamento de dados. Esta tese propõe um novo método para o particionamento de dados que tem como objetivo promover a escalabilidade de repositórios em nuvem. Para minimizar o custo da execução de consultas distribuídas, heurísticas sobre informações de carga de trabalho são utilizadas para identificar afinidades entre dados e estabelecer o agrupamento de itens fortemente relacionados em um mesmo servidor. O problema do particionamento e tratado pelos processos de fragmentação e alocação. O processo de fragmentação define unidades de armazenamento que contem itens de dados fortemente relacionados. Na fase seguinte, o processo de alocação utiliza o mesmo critério de agrupamento para co-alocar fragmentos nos servidores do repositório. A replicação é utilizada para maximizar a quantidade de dados relacionados em um mesmo servidor, porém, a quantidade de replicas gerada é controlada por todo o processo. A metodologia proposta esta focada em modelos em grafo estabelecidos pelos formatos RDF e XML, e que permitem representar uma variedade de outros modelos. A principal contribuição desta tese esta em definir o particionamento sobre uma visão sumarizada de um banco de dados similar a um esquema de banco de dados. Além de evitar a exaustão do processo de particionamento sobre grandes bases, esta solução permite reaplicar a estratégia obtida sobre novas porções de dados que estejam de acordo com o esquema e a carga de trabalho assumidos pelo processo. Esta metodologia se mostra adequada para acomodar o volume crescente de dados associado a repositórios em nuvem. Resultados experimentais mostram que a solução proposta é efetiva para melhorar o desempenho de consultas, se comparada a abordagens alternativas que tratam o mesmo problema.Abstract: The new challenges in data management have been referred to as Big Data. This term is related to an increasing number of applications characterized by generating data with a variety of types, huge volume, and by requiring high velocity processing. At the same time, cloud computing technologies are transforming the operational and economic aspects of computing, mainly due to the introduction of infrastructures to deploy scalable services. Cloud platforms have been properly applied to support data management and address Big Data challenges through a database service in the cloud (DaaS - Database as a Service). One approach to scale applications that process massive amounts of information is to fragment huge datasets and allocate them across distributed data servers. In this context, the main problem is to apply a partitioning schema that maximizes local query processing and avoids the cost of message passing among servers. Besides this problem, data variety and the ever-increasing volume of cloud datastores pose new challenges to traditional partitioning approaches. This thesis provides a new partitioning approach to scale query processing on cloud datastores. In order to minimize the cost of distributed queries, we apply heuristics based on workload data to identify the affinity among data items and cluster the most correlated data in the same server. We tackle the data partitioning problem as a twofold problem. First, data fragmentation defines storage units with strongly correlated items. Further, data allocation aims to collocate fragments that share correlated items. Data replication is applied to cluster related data as much as possible. However, data redundancy is controlled throughout the process. We focus on graph models given by the RDF and XML formats in order to support data variety. Our main contribution is a partitioning strategy defined over a summarized view of the dataset given as a database schema. The result of the process consists of a set of partitioning templates, which can be used to partition an existing dataset, as well as maintain the partitioning process when new data that conform to the schema and the workload are inserted to the dataset. This approach is suitable to deal with the increasing volume of data related to cloud datastores. Experimental results show that the proposed solution is effective for improving the query performance in cloud datastores, compared to related approaches
    corecore