45 research outputs found

    Optimizing inequality joins in Datalog with approximated constraint propagation

    Get PDF
    Datalog systems evaluate joins over arithmetic (in)equalities as a naive generate-and-test of Cartesian products. We exploit aggregates in a source-to-source transformation to reduce the size of Cartesian products and to improve performance. Our approach approximates the well-known propagation technique from Constraint Programming. Experimental evaluation shows good run time speed-ups on a range of non-recursive as well as recursive programs. Furthermore, our technique improves upon the previously reported in the literature constraint magic set transformation approach

    Optimizing Inequality Joins in Datalog with Approximated Constraint Propagation

    Full text link

    Processing Rank-Aware Queries in Schema-Based P2P Systems

    Get PDF
    Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein globales Schema verwaltet. Anfragen an das System werden auf diesem globalen Schema formuliert und vom Mediator bearbeitet, indem relevante Daten von den Datenquellen transparent für den Benutzer angefragt werden. Aufbauend auf diesen Systemen entstanden schließlich Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind diese Peers autonom und können das Netzwerk jederzeit verlassen bzw. betreten. Die potentiell riesige Datenmenge, die in einem derartigen Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit Approximationstechniken, an, da diese Operatoren lediglich diejenigen Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die vollständige Ergebnismenge berechnet werden sondern nur der Teil, der tatsächlich relevant für das Endergebnis ist. Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen, sondern auch wie sie gepflegt werden können. Schließlich stellen wir SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor, ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query processing in data integration and P2P systems. Conventional data integration systems consist of multiple sources with possibly different schemas, adhere to a hierarchical structure, and have a central component (mediator) that manages a global schema. Queries are formulated against this global schema and the mediator processes them by retrieving relevant data from the sources transparently to the user. Arising from these systems, eventually Peer Data Management Systems (PDMSs), or schema-based P2P systems respectively, have attracted attention. Peers participating in a PDMS can act both as a mediator and as a data source, are autonomous, and might leave or join the network at will. Due to these reasons peers often hold incomplete or erroneous data sets and mappings. The possibly huge amount of data available in such a network often results in large query result sets that are hard to manage. Due to these reasons, retrieving the complete result set is in most cases difficult or even impossible. Applying rank-aware query operators such as top-N and skyline, possibly in conjunction with approximation techniques, is a remedy to these problems as these operators select only those result records that are most relevant to the user. Being aware that in most cases only a small fraction of the complete result set is actually output to the user, retrieving the complete set before evaluating such operators is obviously inefficient. Therefore, the questions we want to answer in this dissertation are how to compute such queries in PDMSs and how to do that efficiently. We propose strategies for efficient query processing in PDMSs that exploit the characteristics of rank-aware queries and optionally apply approximation techniques. A peer's relevance is determined on two levels: on schema-level and on data-level. According to its relevance a peer is either considered for query processing or not. Because of heterogeneity queries need to be rewritten, enabling cooperation between peers that use different schemas. As existing query rewriting techniques mostly consider conjunctive queries only, we present an extension that allows for rewriting queries involving rank-aware query operators. As PDMSs are dynamic systems and peers might update their local data, this dissertation addresses not only the problem of considering such structures within a query processing strategy but also the problem of keeping them up-to-date. Finally, we provide a system-level evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) -- a system created in the context of this dissertation implementing all presented techniques

    (I) A Declarative Framework for ERP Systems(II) Reactors: A Data-Driven Programming Model for Distributed Applications

    Get PDF
    To those who can be swayed by argument and those who know they do not have all the answers This dissertation is a collection of six adapted research papers pertaining to two areas of research. (I) A Declarative Framework for ERP Systems: • POETS: Process-Oriented Event-driven Transaction Systems. The paper describes an ontological analysis of a small segment of the enterprise domain, namely the general ledger and accounts receivable. The result is an event-based approach to designing ERP systems and an abstract-level sketch of the architecture. • Compositional Specification of Commercial Contracts. The paper de-scribes the design, multiple semantics, and use of a domain-specific lan-guage (DSL) for modeling commercial contracts. • SMAWL: A SMAll Workflow Language Based on CCS. The paper show

    Approximate query processing in a data warehouse using random sampling

    Get PDF
    Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency. We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input. We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions. Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times

    Kiel Declarative Programming Days 2013

    Get PDF
    This report contains the papers presented at the Kiel Declarative Programming Days 2013, held in Kiel (Germany) during September 11-13, 2013. The Kiel Declarative Programming Days 2013 unified the following events: * 20th International Conference on Applications of Declarative Programming and Knowledge Management (INAP 2013) * 22nd International Workshop on Functional and (Constraint) Logic Programming (WFLP 2013) * 27th Workshop on Logic Programming (WLP 2013) All these events are centered around declarative programming, an advanced paradigm for the modeling and solving of complex problems. These specification and implementation methods attracted increasing attention over the last decades, e.g., in the domains of databases and natural language processing, for modeling and processing combinatorial problems, and for high-level programming of complex, in particular, knowledge-based systems

    A semantic web rule language for geospatial domains

    Get PDF
    Retrieval of geographically-referenced information on the Internet is now a common activity. The web is increasingly being seen as a medium for the storage and exchange of geographic data sets in the form of maps. The geospatial-semantic web (GeoWeb) is being developed to address the need for access to current and accurate geo-information. The potential applications of the GeoWeb are numerous, ranging from specialised application domains for storing and analysing geo-information to more common applications by casual users for querying and visualising geo-data, e.g. finding locations of services, descriptions of routes, etc. Ontologies are at the heart of W3C's semantic web initiative to provide the necessary machine understanding to the sheer volumes of information contained on the internet. For the GeoWeb to succeed the development of ontologies for the geographic domain are crucial. Semantic web technologies to represent ontologies have been developed and standardised. OWL, the Web Ontology Language, is the most expressive of these enabling a rich form of reasoning, thanks to its formal description logic underpinnings. Building geo-ontologies involves a continuous process of update to the originally modelled data to reflect change over time as well as to allow for ontology expansion by integrating new data sets, possibly from different sources. One of the main challenges in this process is finding means of ensuring the integrity of the geo-ontology and maintaining its consistency upon further evolution. Representing and reasoning with geographic ontologies in OWL is limited. Firstly, OWL is not an integrity checking language due to it's non-unique name and open world assumptions. Secondly, it can not represent spatial datatypes, can not compute information using spatial operators and does not have any form of spatial index. Finally, OWL does not support complex property composition needed to represent qualitative spatial reasoning over spatial concepts. To address OWL's representational inefficiencies, new ontology languages have been proposed based on the intersection or union of OWL (in particular the DL family corresponding to OWL) with logic programs (rule languages). In this work, a new Semantic Web Spatial Rule Language (SWSRL) is proposed, based on the syntactic core of the Description Logic Programs paradigm (DLP), and the semantics of a Logic Program. The language is built to support the expression of geospatial ontological axioms and geospatial integrity and deduction rules. A hybrid framework to integrate both qualitative symbolic information in SWSRL with quantitative, geometric information using spatial datatypes in a spatial database is proposed. Two notable features of SWSRL are 1) the language is based on a prioritised de fault logic that allows the expression of default integrity rules and their exceptions and 2) the implementation of the language uses an interleaved mode of inference for on the fly computation (either qualitative or quantitative) deduction of spatial relations. SWSRL supports an OGC complaint spatial syntax, and a standardised definition of rule meta data. Both features aid the construction, description, identification and categorisation of designed and implemented rules within large rule sets. The language and the developed engine are evaluated using synthetic as well as real data sets in the context of developing geographic ontologies for geographic information retrieval on the Semantic Web. Empirical experiments are also presented to test the scalability and applicability of the developed framework

    Mapper: an efficient data transformation operator

    Get PDF
    Tese de doutoramento em Informática (Engenharia Informática), apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2008Data transformations are fundamental operations in legacy data migration, data integration, data cleaning, and data warehousing. These operations are often implemented as relational queries that aim at leveraging the optimization capabilities of most DBMSs. However, relational query languages like SQL are not expressive enough to specify one-to-many data transformations, an important class of data transformations that produce several output tuples for a single input tuple. These transformations are required for solving several types of data heterogeneities, like those that occur when the source data represents aggregations of the target data. This thesis proposes a new relational operator, named data mapper, as an extension to the relational algebra to address one-to-many data transformations and focus on its optimization. It also provides algebraic rewriting rules and execution algorithms for the logical and physical optimization, respectively. As a result, queries may be expressed as a combination of standard relational operators and mappers. The proposed optimizations have been experimentally validated and the key factors that influence the obtained performance gains identified. Keywords: Relational Algebra, Data Transformation, Data Integration, Data Cleaning, Data WarehousingAs transformações de dados são operações fundamentais em processos de migração de dados de sistemas legados, integração de dados, limpeza de dados e ao refrescamento de Data Warehouses. Usualmente, estas operações são implementadas através de interrogações relacionais por forma a explorar as optimizações proporcionadas pela maioria dos SGBDs. No entanto, as linguagens de interrogação relacionais, como o SQL, não são suficientemente expressivas para especificar as transformações de dados do tipo um-para-muitos. Esta importante classe de transformações é necessária para resolver de forma adequada diversos tipos de heterogeneidades de dados tais como as que decorrem de situações em que os dados do esquema origem representam uma agregação dos dados do sistema destino. Esta tese propõe a extensão da álgebra relacional com um novo operador relacional denominado data mapper, por forma a permitir a especificação e optimização de transformações de dados um-para-muitos. O trabalho apresenta regras de reescrita algébrica juntamente com diversos algoritmos de execução que proporcionam, respectivamente, a optimização lógica e física de transformações de dados um-para-muitos. Como resultado, é possivel optimizar transformações de dados que combinem operadores relacionais comuns com data mappers. As optimizações propostas foram validadas experimentalmente e identificados os factores que influênciam os seus respectivos ganhos
    corecore