18 research outputs found

    Characterizing approximate-matching dependencies in formal concept analysis with pattern structures

    Get PDF
    Functional dependencies (FDs) provide valuable knowledge on the relations between attributes of a data table. A functional dependency holds when the values of an attribute can be determined by another. It has been shown that FDs can be expressed in terms of partitions of tuples that are in agreement w.r.t. the values taken by some subsets of attributes. To extend the use of FDs, several generalizations have been proposed. In this work, we study approximatematching dependencies that generalize FDs by relaxing the constraints on the attributes, i.e. agreement is based on a similarity relation rather than on equality. Such dependencies are attracting attention in the database field since they allow uncrisping the basic notion of FDs extending its application to many different fields, such as data quality, data mining, behavior analysis, data cleaning or data partition, among others. We show that these dependencies can be formalized in the framework of Formal Concept Analysis (FCA) using a previous formalization introduced for standard FDs. Our new results state that, starting from the conceptual structure of a pattern structure, and generalizing the notion of relation between tuples, approximate-matching dependencies can be characterized as implications in a pattern concept lattice. We finally show how to use basic FCA algorithms to construct a pattern concept lattice that entails these dependencies after a slight and tractable binarization of the original data.Postprint (author's final draft

    Discovering Foreign Keys on Web Tables with the Crowd

    Get PDF
    Foreign-key relationship is one of the most important constraints between two tables. Previous works focused on detecting inclusion dependencies (INDs) or foreign keys in relational database. To discover foreign-key relationship is obviously helpful for analyzing and integrating data in web tables. However, because of poor quality of web tables, it is difficult to discover foreign keys by existing techniques based on checking basic integrity constraints. In this paper, we propose a hybrid human-machine framework to detect foreign keys on web tables. After discovering candidates and evaluating their confidence of being true foreign keys by machine algorithm, we verify those candidates leveraging the power of the crowd. To reduce the monetary cost, a dynamical task selection technique based on conflict detection and inclusion dependency is proposed, which could eliminate redundant tasks and assign the most valuable tasks to workers. Additionally, to make workers complete tasks more effectively and efficiently, sampling strategy is applied to minimize the number of tuples posed to the crowd. We conducted extensive experiments on real-world datasets and results show that our framework can obviously improve foreign key detection accuracy on web tables with lower monetary cost and time cost

    Flexibility in Data Management

    Get PDF
    With the ongoing expansion of information technology, new fields of application requiring data management emerge virtually every day. In our knowledge culture increasing amounts of data and work force organized in more creativity-oriented ways also radically change traditional fields of application and question established assumptions about data management. For instance, investigative analytics and agile software development move towards a very agile and flexible handling of data. As the primary facilitators of data management, database systems have to reflect and support these developments. However, traditional database management technology, in particular relational database systems, is built on assumptions of relatively stable application domains. The need to model all data up front in a prescriptive database schema earned relational database management systems the reputation among developers of being inflexible, dated, and cumbersome to work with. Nevertheless, relational systems still dominate the database market. They are a proven, standardized, and interoperable technology, well-known in IT departments with a work force of experienced and trained developers and administrators. This thesis aims at resolving the growing contradiction between the popularity and omnipresence of relational systems in companies and their increasingly bad reputation among developers. It adapts relational database technology towards more agility and flexibility. We envision a descriptive schema-comes-second relational database system, which is entity-oriented instead of schema-oriented; descriptive rather than prescriptive. The thesis provides four main contributions: (1)~a flexible relational data model, which frees relational data management from having a prescriptive schema; (2)~autonomous physical entity domains, which partition self-descriptive data according to their schema properties for better query performance; (3)~a freely adjustable storage engine, which allows adapting the physical data layout used to properties of the data and of the workload; and (4)~a self-managed indexing infrastructure, which autonomously collects and adapts index information under the presence of dynamic workloads and evolving schemas. The flexible relational data model is the thesis\' central contribution. It describes the functional appearance of the descriptive schema-comes-second relational database system. The other three contributions improve components in the architecture of database management systems to increase the query performance and the manageability of descriptive schema-comes-second relational database systems. We are confident that these four contributions can help paving the way to a more flexible future for relational database management technology

    Flexibility in Data Management

    Get PDF
    With the ongoing expansion of information technology, new fields of application requiring data management emerge virtually every day. In our knowledge culture increasing amounts of data and work force organized in more creativity-oriented ways also radically change traditional fields of application and question established assumptions about data management. For instance, investigative analytics and agile software development move towards a very agile and flexible handling of data. As the primary facilitators of data management, database systems have to reflect and support these developments. However, traditional database management technology, in particular relational database systems, is built on assumptions of relatively stable application domains. The need to model all data up front in a prescriptive database schema earned relational database management systems the reputation among developers of being inflexible, dated, and cumbersome to work with. Nevertheless, relational systems still dominate the database market. They are a proven, standardized, and interoperable technology, well-known in IT departments with a work force of experienced and trained developers and administrators. This thesis aims at resolving the growing contradiction between the popularity and omnipresence of relational systems in companies and their increasingly bad reputation among developers. It adapts relational database technology towards more agility and flexibility. We envision a descriptive schema-comes-second relational database system, which is entity-oriented instead of schema-oriented; descriptive rather than prescriptive. The thesis provides four main contributions: (1)~a flexible relational data model, which frees relational data management from having a prescriptive schema; (2)~autonomous physical entity domains, which partition self-descriptive data according to their schema properties for better query performance; (3)~a freely adjustable storage engine, which allows adapting the physical data layout used to properties of the data and of the workload; and (4)~a self-managed indexing infrastructure, which autonomously collects and adapts index information under the presence of dynamic workloads and evolving schemas. The flexible relational data model is the thesis\' central contribution. It describes the functional appearance of the descriptive schema-comes-second relational database system. The other three contributions improve components in the architecture of database management systems to increase the query performance and the manageability of descriptive schema-comes-second relational database systems. We are confident that these four contributions can help paving the way to a more flexible future for relational database management technology

    Otimização de consultas SPARQL em bases RDF distribuídas

    Get PDF
    Orientadora : Profa. Dra Carmem Satie HaraTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 07/04/2017Inclui referências : f. 83-85Resumo; O modelo de dados RDF vem sendo usado em diversas aplicações devido a sua simplicidade e exibilidade na modelagem de dados quando comparado aos modelos de dados tradicionais. Dado o grande volume de dados RDF existente atualmente, diversas abordagens de processamento de consultas têm sido propostas visando garantir a escalabilidade destas aplicações. De uma forma geral, estas abordagens propõem métodos de distribuição de dados a _m de promover o processamento distribuído e paralelo de consultas SPARQL em sistemas RDF. Embora a distribuição forneça escalabilidade de armazenamento, o custo de comunicação no processamento de consultas pode ser alto. Este trabalho propõe uma abordagem de processamento de consultas SPARQL que tem o objetivo de minimizar o custo de comunicação para o processamento de consultas em sistemas RDF distribuídos. A abordagem explora a existência de padrões de alocação (PAs) na distribuição de dados, fornecida por um método de distribuição controlada de dados, que determina como triplas RDF são agrupadas e armazenadas em um mesmo servidor. Sendo assim, durante a distribuição, fragmentos de bases RDF seguem a composição de um determinado PA. Logo, a abordagem de processamento proposta gera planos de execução de consultas baseando-se nestes padrões viabilizando a escolha de duas estratégias de comunicação durante o processamento de consultas: get-frag e send-result. Na primeira estratégia, dada uma consulta, um servidor requisita para servidores remotos fragmentos de dados para a resolução de consultas. Na segunda, o servidor envia resultados intermediários da consulta para outros servidores continuarem a sua execução. Essas estratégias são combinadas em um método, denominado de 2ways, que escolhe a estratégia de comunicação adequada sempre que a execução de consultas transitar entre fragmentos de dados. A escolha da estratégia depende do número de mensagens e do volume de dados a ser transmitido entre servidores. Resultados experimentais mostram que 2ways reduz o custo de comunicação de maneira efetiva e melhora o tempo de resposta do processamento de consultas SPARQL em sistemas RDF distribuídos. Por fim, considerando que bases RDF podem ser alteradas por meio de operações de exclusão/interseção de triplas, este trabalho estende a abordagem de processamento proposta considerando que nem sempre novos dados inseridos estarão de acordo com os PAs predefinidos. A abordagem de atualização define um tipo especial de PA, denominado de PaOverow, para o armazenamento de dados que não podem ser categorizados pelos PAs existentes. Logo, o PaOverow também deve ser considerado no planejamento e no processamento de consultas. Um estudo experimental inicial mostra que, como esperado, a adoção do PaOverow pode aumentar o tempo de resposta de consultas na abordagem de processamento proposta. Palavras-chave: RDF, SPARQL, Processamento Distribuído de Consultas, Otimização de Consultas.Abstract: RDF has been used by many applications due to its simplicity and exibility in data modeling. Due to the huge volume of RDF data that exists nowadays, many distributed query processing approaches have been proposed aiming to ensure scalability for these applications. In general, these approaches propose data distribution methods promoting distributed and parallel SPARQL query processing. However, while distribution may provide storage scalability, it may also incur high communication costs for processing queries. This work presents a parallel and distributed query processing approach that aims to minimize the communication cost. The approach explores the existence of data allocation patterns (PAs) for data distribution, provided by a controlled data distribution method, that determine how RDF triples should be grouped and stored on the same server. Fragments of the RDF datastore follow a given allocation pattern. The approach generates execution plans based on this distribution model making possible the choice of two communication strategies for query processing: get-frag and send-result. With the get-frag approach, a server requests remote servers to send fragments that contain data required by a query. The send-result approach, on the other hand, forwards intermediate results to other servers to continue the query processing. These strategies are combined on a method, called 2ways, that chooses the adequate communication strategy whenever queries traverse fragment boundaries. The choice of the communication strategy is based on the number of requisitions and the volume of the data to be transmitted. Experimental results show that our proposed technique e_ectively reduces the communication cost and improves the response time for processing SPARQL queries on a distributed RDF datastore. Finally, considering that RDF datasets are dynamic, and may be updated by delete/insert operations, this work extends the query processing approach considering that not all newly inserted data may conform to the prede_ned allocation patterns. We de_ne a special purpose type of PA, called PaOverow, for storing data that can not be categorized by existing PAs. Consequentelly, the PaOverow must be considered in query planning and processing. An initial experimental study shows that, as expected, the PaOverow adoption can increase the response time for processing queries on the proposed processing approach. Keywords: RDF, SPARQL, Distributed Query Processing, Query Optimization

    Κατανεμημένη αποτίμηση επερωτήσεων και συλλογιστική για το μοντέλο RDF σε δίκτυα ομοτίμων κόμβων

    Get PDF
    Με το ενδιαφέρον για τις εφαρμογές του Σημασιολογικού Ιστού να αυξάνεται ραγδαία, το μοντέλο RDF και RDFS έχει γίνει ένα από τα πιο ευρέως χρησιμοποιούμενα μοντέλα δεδομένων για την αναπαράσταση και την ενσωμάτωση δομημένης πληροφορίας στον Ιστό. Το πλήθος των διαθέσιμων πηγών πληροφορίας RDF συνεχώς αυξάνεται με αποτέλεσμα να υπάρχει μια επιτακτική ανάγκη για τη διαχείριση RDF δεδομένων. Σε αυτή τη διατριβή επικεντρωνόμαστε στην κατανεμημένη διαχείριση RDF δεδομένων σε δίκτυα ομότιμων κόμβων. Σχεδιάζουμε και υλοποιούμε το σύστημα Atlas, ένα πλήρως κατανεμημένο σύστημα για την αποθήκευση RDF και RDFS δεδομένων, την αποτίμηση και βελτιστοποίηση επερωτήσεων στη γλώσσα SPARQL και τη συλλογιστική στο μοντέλο RDFS. Το σύστημα Atlas χρησιμοποιεί κατανεμημένους πίνακες κατακερματισμού, μια δημοφιλή περίπτωση δικτύων ομότιμων κόμβων. Αρχικά, αναλύουμε κατανεμημένους αλγόριθμους για συλλογιστική RDFS χρησιμοποιώντας κατανεμημένους πίνακες κατακερματισμού. Υλοποιηούμε διάφορες παραλλαγές των αλγορίθμων προς τα εμπρός αλυσίδα εκτέλεσης και προς τα πίσω αλυσίδα εκτέλεσης καθώς και έναν αλγόριθμο που χρησιμοποιεί την τεχνική μετασχηματισμού των κανόνων σε μαγικό σύνολο. Αποδεικνύουμε θεωρητικά την ορθότητα των αλγορίθμων αυτών και προσφέρουμε μια συγκριτική μελέτη τόσο αναλυτικά όσο και πειραματικά. Παράλληλα, προτείνουμε αλγορίθμους και τεχνικές για την αποτίμηση και τη βελτιστοποίηση επερωτήσεων στη γλώσσα SPARQL για RDF δεδομένα που είναι αποθηκευμένα σε κατανεμημένους πίνακες κατακερματισμού. Οι τεχνικές βελτιστοποίησης βασίζονται σε εκτιμήσεις επιλεκτικότητας και έχουν στόχο τη μείωση του χρόνου απόκρισης της επερώτησης καθώς και της κατανάλωσης εύρους ζώνης του δικτύου. Η εκτεταμένη πειραματική αξιολόγηση των μεθόδων βελτιστοποίησης γίνεται σε μια τοπική συστάδα υπολογιστών χρησιμοποιώντας ένα ευρέως διαδεδομένο σημείο αναφοράς μετρήσεων.With the interest in Semantic Web applications rising rapidly, the Resource Description Framework (RDF) and its accompanying vocabulary description language, RDF Schema (RDFS), have become one of the most widely used data models for representing and integrating structured information in the Web. With the vast amount of available RDF data sources on the Web increasing rapidly, there is an urgent need for RDF data management. In this thesis, we focus on distributed RDF data management in peer-to-peer (P2P) networks. More specifically, we present results that advance the state-of-the-art in the research area of distributed RDF query processing and reasoning in P2P networks. We fully design and implement a P2P system, called Atlas, for the distributed query processing and reasoning of RDF and RDFS data. Atlas is built on top of distributed hash tables (DHTs), a commonly-used case of P2P networks. Initially, we study RDFS reasoning algorithms on top of DHTs. We design and develop distributed forward and backward chaining algorithms, as well as an algorithm which works in a bottom-up fashion using the magic sets transformation technique. We study theoretically the correctness of our reasoning algorithms and prove that they are sound and complete. We also provide a comparative study of our algorithms both analytically and experimentally. In the experimental part of our study, we obtain measurements in the realistic large-scale distributed environment of PlanetLab as well as in the more controlled environment of a local cluster. Moreover, we propose algorithms for SPARQL query processing and optimization over RDF(S) databases stored on top of distributed hash tables. We fully implement and evaluate a DHT-based optimizer. The goal of the optimizer is to minimize the time for answering a query as well as the bandwidth consumed during the query evaluation. The optimization algorithms use selectivity estimates to determine the chosen query plan. Our algorithms and techniques have been extensively evaluated in a local cluster

    Efficient asymmetric inclusion of regular expressions with interleaving and counting for XML type-checking

    Get PDF
    The inclusion of Regular Expressions (REs) is the kernel of any type-checking algorithm for XML manipulation languages. XML applications would benefit from the extension of REs with interleaving and counting, but this is not feasible in general, since inclusion is EXPSPACE-complete for such extended REs. In Colazzo et al. (2009) [1] we introduced a notion of ?conflict-free REs?, which are extended REs with excellent complexity behaviour, including a polynomial inclusion algorithm [1] and linear membership (Ghelli et al., 2008 [2]). Conflict-free REs have interleaving and counting, but the complexity is tamed by the ?conflict-free? limitations, which have been found to be satisfied by the vast majority of the content models published on the Web.However, a type-checking algorithm needs to compare machine-generated subtypes against human-defined supertypes. The conflict-free restriction, while quite harmless for the human-defined supertype, is far too restrictive for the subtype. We show here that the PTIME inclusion algorithm can be actually extended to deal with totally unrestricted REs with counting and interleaving in the subtype position, provided that the supertype is conflict-free.This is exactly the expressive power that we need in order to use subtyping inside type-checking algorithms, and the cost of this generalized algorithm is only quadratic, which is as good as the best algorithm we have for the symmetric case (see [1]). The result is extremely surprising, since we had previously found that symmetric inclusion becomes NP-hard as soon as the candidate subtype is enriched with binary intersection, a generalization that looked much more innocent than what we achieve here

    Automated Data Preparation using Semantics of Data Science Artifacts

    Get PDF
    Data preparation is critical for improving model accuracy. However, data scientists often work independently, spending most of their time writing code to identify and select relevant features, enrich, clean, and transform their datasets to train predictive models for solving a machine learning problem. Working in isolation from each other, they lack support to learn from what other data scientists have performed on similar datasets. This thesis addresses these challenges by presenting a novel approach that automates data preparation using the semantics of data science artifacts. Therefore, this work proposes KGFarm, a holistic platform for automating data preparation based on machine learning models trained using the semantics of data science artifacts, captured as a knowledge graph (KG). These semantics comprise datasets and pipeline scripts. KGFarm seamlessly integrates with existing data science platforms, effectively enabling scientific communities to automatically discover and learn from each other’s work. KGFarm’s models were trained on top of a KG constructed from the top-rated 1000 Kaggle datasets and 13800 pipeline scripts with the highest number of votes. Our comprehensive evaluation uses 130 unseen datasets collected from different AutoML benchmarks to compare KGFarm against state-of-the-art systems in data cleaning, data transformation, feature selection, and feature engineering tasks. Our experiments show that KGFarm consumes significantly less time and memory compared to the state-of-the-art systems while achieving comparable or better accuracy. Hence, KGFarm effectively handles large-scale datasets and empowers data scientists to automate data preparation pipelines interactively

    Adapting information retrieval to user needs in an evolving web environment

    Get PDF
    [no abstract

    Scalable diversification for data exploration platforms

    Get PDF
    corecore