14,073 research outputs found

    Using Ontologies for Semantic Data Integration

    Get PDF
    While big data analytics is considered as one of the most important paths to competitive advantage of today’s enterprises, data scientists spend a comparatively large amount of time in the data preparation and data integration phase of a big data project. This shows that data integration is still a major challenge in IT applications. Over the past two decades, the idea of using semantics for data integration has become increasingly crucial, and has received much attention in the AI, database, web, and data mining communities. Here, we focus on a specific paradigm for semantic data integration, called Ontology-Based Data Access (OBDA). The goal of this paper is to provide an overview of OBDA, pointing out both the techniques that are at the basis of the paradigm, and the main challenges that remain to be addressed

    Ontology-Based Data Access and Integration

    Get PDF
    An ontology-based data integration (OBDI) system is an information management system consisting of three components: an ontology, a set of data sources, and the mapping between the two. The ontology is a conceptual, formal description of the domain of interest to a given organization (or a community of users), expressed in terms of relevant concepts, attributes of concepts, relationships between concepts, and logical assertions characterizing the domain knowledge. The data sources are the repositories accessible by the organization where data concerning the domain are stored. In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently from the others. The mapping is a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology. The main purpose of an OBDI system is to allow information consumers to query the data using the elements in the ontology as predicates. In the special case where the organization manages a single data source, the term ontology-based data access (ODBA) system is used

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    Logic-Based Decision Support for Strategic Environmental Assessment

    Full text link
    Strategic Environmental Assessment is a procedure aimed at introducing systematic assessment of the environmental effects of plans and programs. This procedure is based on the so-called coaxial matrices that define dependencies between plan activities (infrastructures, plants, resource extractions, buildings, etc.) and positive and negative environmental impacts, and dependencies between these impacts and environmental receptors. Up to now, this procedure is manually implemented by environmental experts for checking the environmental effects of a given plan or program, but it is never applied during the plan/program construction. A decision support system, based on a clear logic semantics, would be an invaluable tool not only in assessing a single, already defined plan, but also during the planning process in order to produce an optimized, environmentally assessed plan and to study possible alternative scenarios. We propose two logic-based approaches to the problem, one based on Constraint Logic Programming and one on Probabilistic Logic Programming that could be, in the future, conveniently merged to exploit the advantages of both. We test the proposed approaches on a real energy plan and we discuss their limitations and advantages.Comment: 17 pages, 1 figure, 26th Int'l. Conference on Logic Programming (ICLP'10

    Magic Sets for Disjunctive Datalog Programs

    Get PDF
    In this paper, a new technique for the optimization of (partially) bound queries over disjunctive Datalog programs with stratified negation is presented. The technique exploits the propagation of query bindings and extends the Magic Set (MS) optimization technique. An important feature of disjunctive Datalog is nonmonotonicity, which calls for nondeterministic implementations, such as backtracking search. A distinguishing characteristic of the new method is that the optimization can be exploited also during the nondeterministic phase. In particular, after some assumptions have been made during the computation, parts of the program may become irrelevant to a query under these assumptions. This allows for dynamic pruning of the search space. In contrast, the effect of the previously defined MS methods for disjunctive Datalog is limited to the deterministic portion of the process. In this way, the potential performance gain by using the proposed method can be exponential, as could be observed empirically. The correctness of MS is established thanks to a strong relationship between MS and unfounded sets that has not been studied in the literature before. This knowledge allows for extending the method also to programs with stratified negation in a natural way. The proposed method has been implemented in DLV and various experiments have been conducted. Experimental results on synthetic data confirm the utility of MS for disjunctive Datalog, and they highlight the computational gain that may be obtained by the new method w.r.t. the previously proposed MS methods for disjunctive Datalog programs. Further experiments on real-world data show the benefits of MS within an application scenario that has received considerable attention in recent years, the problem of answering user queries over possibly inconsistent databases originating from integration of autonomous sources of information.Comment: 67 pages, 19 figures, preprint submitted to Artificial Intelligenc
    corecore