19 research outputs found

    Outerjoins as disjunctions

    Get PDF

    A Method for Automatically Generating Join Queries Based on Relations-Attributes Distance Matrix over Data Lakes

    Get PDF
    Techniques for identifying joinable or unionable tables in data lakes can yield valuable information for data scientists. However, more than half of their working time is spent familiarizing themselves with the metadata and correlations of datasets. Simplifying the use of information in data lakes is crucial for enhancing their utilization. The existing solution of integrating correlated relations into a single large data table via full disjunction requires integration updating when either data or metadata changes, complicating data maintenance. This paper proposes a method for automatically generating join queries based on the distance matrix of relations and attributes in data lakes. The distance matrix only requires updating when metadata changes, simplifying data maintenance. Experimental results demonstrate that once the distance matrix is generated, the time required to generate the join queries is negligible. Compared to the existing solution, the time cost for executing join queries over correlated tables is nearly identical to that of selection queries over integrated tables. The results of these two queries are also the same, showcasing the effectiveness and efficiency of our method

    Implementing a universal relation interface using access scripts with binding patterns

    Get PDF
    We propose the use of the universal relation as a user interface to provide transparent access to a network of distributed, heterogeneous, and autonomous information sources. We implement this interface in two layers. The lower layer consists of access scripts, which encapsulate knowledge about information sources and are capable of answering basic queries. The upper layer uses combinations of these scripts to answer user queries phrased in terms of a universal relation. Access scripts know how to obtain information either directly from sources or from service providers (mediators, traders, and the like). They present this information in relational form, but with an inherent direction, in the sense that whenever values for a fixed subset of attributes of the relation are given, the access script will deliver values for the rest of the attributes in the relation. In this paper, we address the problem of defining the semantics of a user query posed against the universal relation and of finding a sequence of access script invocations that gathers the information requested in the query

    Semantic Integration of heterogeneous data sources in the MOMIS Data Transformation System

    Get PDF
    In the last twenty years, many data integration systems following a classical wrapper/mediator architecture and providing a Global Virtual Schema (a.k.a. Global Virtual View - GVV) have been proposed by the research community. The main issues faced by these approaches range from system-level heterogeneities, to structural syntax level heterogeneities at the semantic level. Despite the research effort, all the approaches proposed require a lot of user intervention for customizing and managing the data integration and reconciliation tasks. In some cases, the effort and the complexity of the task is huge, since it requires the development of specific programming codes. Unfortunately, due to the specificity to be addressed, application codes and solutions are not frequently reusable in other domains. For this reason, the Lowell Report 2005 has provided the guideline for the definition of a public benchmark for information integration problem. The proposal, called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches), focuses on how the data integration systems manage syntactic and semantic heterogeneities, which definitely are the greatest technical challenges in the field. We developed a Data Transformation System (DTS) that supports data transformation functions and produces query translation in order to push down to the sources the execution. Our DTS is based on MOMIS, a mediator-based data integration system that our research group is developing and supporting since 1999. In this paper, we show how the DTS is able to solve all the twelve queries of the THALIA benchmark by using a simple combination of declarative translation functions already available in the standard SQL language. We think that this is a remarkable result, mainly for two reasons: firstly to the best of our knowledge there is no system that has provided a complete answer to the benchmark, secondly, our queries does not require any overhead of new code

    Querying Semantically Tagged Documents on the World-Wide Web

    Full text link

    An incremental algorithm for computing ranked full disjunctions

    Get PDF
    AbstractThe full disjunction is a variation of the join operator that maximally combines tuples from connected relations, while preserving all information in the relations. The full disjunction can be seen as a natural extension of the binary outerjoin operator to an arbitrary number of relations and is a useful operator for information integration. This paper presents the algorithm IncrementalFD for computing the full disjunction of a set of relations. IncrementalFD improves upon previous algorithms for computing the full disjunction in four ways. First, it has a lower total runtime when computing the full result and a lower runtime when computing only k tuples of the result, for any constant k. Second, for a natural class of ranking functions, IncrementalFD can be adapted to return tuples in ranking order. Third, a variation of IncrementalFD can be used to return approximate full disjunctions (which contain maximal approximately join consistent tuples). Fourth, IncrementalFD can be adapted to have a block-based execution, instead of a tuple-based execution

    Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins)

    Full text link
    SPARQL basic graph pattern (BGP) (a.k.a. SQL inner-join) query optimization is a well researched area. However, optimization of OPTIONAL pattern queries (a.k.a. SQL left-outer-joins) poses additional challenges, due to the restrictions on the \textit{reordering} of left-outer-joins. The occurrence of such queries tends to be as high as 50% of the total queries (e.g., DBPedia query logs). In this paper, we present \textit{Left Bit Right} (LBR), a technique for \textit{well-designed} nested BGP and OPTIONAL pattern queries. Through LBR, we propose a novel method to represent such queries using a graph of \textit{supernodes}, which is used to aggressively prune the RDF triples, with the help of compressed indexes. We also propose novel optimization strategies -- first of a kind, to the best of our knowledge -- that combine together the characteristics of \textit{acyclicity} of queries, \textit{minimality}, and \textit{nullification}, \textit{best-match} operators. In this paper, we focus on OPTIONAL patterns without UNIONs or FILTERs, but we also show how UNIONs and FILTERs can be handled with our technique using a \textit{query rewrite}. Our evaluation on RDF graphs of up to and over one billion triples, on a commodity laptop with 8 GB memory, shows that LBR can process \textit{well-designed} low-selectivity complex queries up to 11 times faster compared to the state-of-the-art RDF column-stores as Virtuoso and MonetDB, and for highly selective queries, LBR is at par with them.Comment: SIGMOD 201
    corecore