15 research outputs found

    Querying with access patterns and integrity constraints

    Full text link

    InterPoll: Crowd-Sourced Internet Polls

    Get PDF
    Crowd-sourcing is increasingly being used to provide answers to online polls and surveys. However, existing systems, while taking care of the mechanics of attracting crowd workers, poll building, and payment, provide little to help the survey-maker or pollster in obtaining statistically significant results devoid of even the obvious selection biases. This paper proposes InterPoll, a platform for programming of crowd-sourced polls. Pollsters express polls as embedded LINQ queries and the runtime correctly reasons about uncertainty in those polls, only polling as many people as required to meet statistical guarantees. To optimize the cost of polls, InterPoll performs query optimization, as well as bias correction and power analysis. The goal of InterPoll is to provide a system that can be reliably used for research into marketing, social and political science questions. This paper highlights some of the existing challenges and how InterPoll is designed to address most of them. In this paper we summarize some of the work we have already done and give an outline for future work

    Web and Semantic Web Query Languages

    Get PDF
    A number of techniques have been developed to facilitate powerful data retrieval on the Web and Semantic Web. Three categories of Web query languages can be distinguished, according to the format of the data they can retrieve: XML, RDF and Topic Maps. This article introduces the spectrum of languages falling into these categories and summarises their salient aspects. The languages are introduced using common sample data and query types. Key aspects of the query languages considered are stressed in a conclusion

    Data Integration on the (Semantic) Web with Rules and Rich Unification

    Get PDF
    For the last decade a multitude of new data formats for the World Wide Web have been developed, and a huge amount of heterogeneous semi-structured data is flourishing online. With the ever increasing number of documents on the Web, rules have been identified as the means of choice for reasoning about this data, transforming and integrating it. Query languages such as SPARQL and rule languages such as Xcerpt use compound queries that are matched or unified with semi-structured data. This notion of unification is different from the one that is known from logic programming engines in that it (i) provides constructs that allow queries to be incomplete in several ways (ii) in that variables may have different types, (iii) in that it results in sets of substitutions for the variables in the query instead of a single substitution and (iv) in that subsumption between queries is much harder to decide than in logic programming. This thesis abstracts from Xcerpt query term simulation, SPARQL graph pattern matching and XPath XML document matching, and shows that all of them can be considered as a form of rich unification. Given a set of mappings between substitution sets of different languages, this abstraction opens up the possibility for format-versatile querying, i.e. combination of queries in different formats, or transformation of one format into another format within a single rule. To show the superiority of this approach, this thesis introduces an extension of Xcerpt called Xcrdf, and describes use-cases for the combined querying and integration of RDF and XML data. With XML being the predominant Web format, and RDF the predominant Semantic Web format, Xcrdf extends Xcerpt by a set of RDF query terms and construct terms, including query primitives for RDF containers collections and reifications. Moreover, Xcrdf includes an RDF path query language called RPL that is more expressive than previously proposed polynomial-time RDF path query languages, but can still be evaluated in polynomial time combined complexity. Besides the introduction of this framework for data integration based on rich unification, this thesis extends the theoretical knowledge about Xcerpt in several ways: We show that Xcerpt simulation unification is decidable, and give complexity bounds for subsumption in several fragments of Xcerpt query terms. The proof is based on a set of subsumption monotone query term transformations, and is only feasible because of the injectivity requirement on subterms of Xcerpt queries. The proof gives rise to an algorithm for deciding Xcerpt query term simulation. Moreover, we give a semantics to locally and weakly stratified Xcerpt programs, but this semantics is applicable not only to Xcerpt, but to any rule language with rich unification, including multi-rule SPARQL programs. Finally, we show how Xcerpt grouping stratification can be reduced to Xcerpt negation stratification, thereby also introducing the notion of local grouping stratification and weak grouping stratification

    Efficient Source Selection For SPARQL Endpoint Query Federation

    Get PDF
    The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of linked and distributed datasets from multiple domains. Due to the decentralised architecture of the Web of Data, several of these datasets contain complementary data. Running complex queries on this compendium thus often requires accessing data from different data sources within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. This thesis addresses two key areas of federated SPARQL query processing: (1) efficient source selection, and (2) comprehensive SPARQL benchmarks to test and ranked federated SPARQL engines as well as triple stores. Efficient Source Selection: Efficient source selection is one of the most important optimization steps in federated SPARQL query processing. An overestimation of query relevant data sources increases the network traffic, result in irrelevant intermediate results, and can significantly affect the overall query processing time. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. Similarly, only little attention has been paid to the effect of duplicated data on federated querying. This thesis presents HiBISCuS and TBSS, novel hypergraph-based source selection approaches, and DAW, a duplicate-aware source selection approach to federated querying over the Web of Data. Each of these approaches can be combined directly with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We combined the three (HiBISCuS, DAW, and TBSS) source selections approaches with query rewriting to form a complete SPARQL query federation engine named Quetsal. Furthermore, we present TopFed, a Cancer Genome Atlas (TCGA) tailored federated query processing engine that exploits the data distribution to perform intelligent source selection while querying over large TCGA SPARQL endpoints. Finally, we address the issue of rights managements and privacy while accessing sensitive resources. To this end, we present SAFE: a global source selection approach that enables decentralised, policy-aware access to sensitive clinical information represented as distributed RDF Data Cubes. Comprehensive SPARQL Benchmarks: Benchmarking is indispensable when aiming to assess technologies with respect to their suitability for given tasks. While several benchmarks and benchmark generation frameworks have been developed to evaluate federated SPARQL engines and triple stores, they mostly provide a one-fits-all solution to the benchmarking problem. This approach to benchmarking is however unsuitable to evaluate the performance of a triple store for a given application with particular requirements. The fitness of current SPARQL query federation approaches for real applications is difficult to evaluate with current benchmarks as current benchmarks are either synthetic or too small in size and complexity. Furthermore, state-of-the-art federated SPARQL benchmarks mostly focused on a single performance criterion, i.e., the overall query runtime. Thus, they cannot provide a fine-grained evaluation of the systems. We address these drawbacks by presenting FEASIBLE, an automatic approach for the generation of benchmarks out of the query history of applications, i.e., query logs and LargeRDFBench, a billion-triple benchmark for SPARQL query federation which encompasses real data as well as real queries pertaining to real bio-medical use cases. Our evaluation results show that HiBISCuS, TBSS, TopFed, DAW, and SAFE all can significantly reduce the total number of sources selected and thus improve the overall query performance. In particular, TBSS is the first source selection approach to remain under 5% overall relevant sources overestimation. Quetsal has reduced the number of sources selected (without losing recall), the source selection time as well as the overall query runtime as compared to state-of-the-art federation engines. The LargeRDFBench evaluation results suggests that the performance of current SPARQL query federation systems on simple queries does not reflect the systems\\\'' performance on more complex queries. Moreover, current federation systems seem unable to deal with many of the challenges that await them in the age of Big Data. Finally, the FEASIBLE\\\''s evaluation results shows that it generates better sample queries than the state-of-the-art. In addition, the better query selection and the larger set of query types used lead to triple store rankings which partly differ from the rankings generated by previous works

    FICCS; A Fact Integrity Constraint Checking System

    Get PDF

    An infrastructure for the development of Semantic Desktop applications

    Get PDF
    In einem permanent wachsenden Ausmaß wird unser Leben digital organisiert. Viele tagtĂ€gliche AktivitĂ€ten manifestieren sich (auch) in digitaler Form: einerseits explizit, wenn digitale Informationen fĂŒr Arbeitsaufgaben oder in der Freizeit entstehen und verwendet werden; andererseits auch implizit, wenn Informationen indirekt, als Konsequenz unseres Handelns, erzeugt oder manipuliert wird. Ein großer Teil dieser InformationsbestĂ€nde ist persönlicher Natur, d.h., diese Information hat einen bestimmten Bezug zu uns als Person. Die Speicher- und Rechenleistung der GerĂ€te, mit denen wir ĂŒblicherweise mit solchen persönlichen Daten interagieren, wurde in den letzten Jahren kontinuierlich erhöht, und es besteht Grund zur Annahme, dass sich diese Entwicklung in der Zukunft fortsetzt. WĂ€hrend also die physische Leistung von Datenspeichern enorm erhöht wurde, hat deren logische und organisatorische Leistung seit der Erfindung der ersten Personal Computer praktisch stagniert. Nach wie vor sind hierarchische Dateisysteme der de-facto-Standard fĂŒr die Organisation von persönlichen Daten. Solche Dateisysteme reprĂ€sentieren Daten als diskrete Einheiten (Dateien), die BlĂ€tter eines Baums von beschrifteten Knoten (Verzeichnisse) darstellen. Die Unterteilung des persönlichen Datenraums in kleine Einheiten unterstĂŒtzt die Handhabung solcher Strukturen durch den Menschen, allerdings können viele Arten von Organisationsinformation nicht adĂ€quat in einer Baumstruktur dargestellt werden. Dies wirkt sich negativ auf die QualitĂ€t der Datenorganisation aus. Aktuelle Forschung im Bereich Personal Information Management liefert zwar mögliche AnsĂ€tze, um hierarchische Systeme zu ersetzen, tendiert jedoch manchmal dazu, die Arbeit mit Information ĂŒberzuformalisieren. Dies ist insbesondere kritisch, weil der durchschnittliche Anwender von PIM-Systemen ĂŒber keine Erfahrung mit komplexen logischen Systemen verfĂŒgt. Diese Arbeit prĂ€sentiert ein alternatives Organisationsmodell fĂŒr persönliche Daten, die darauf abzielt, eine Balance zwischen der unstrukturierten Charakteristik von Dateisystemen und den formalen Eigenschaften von logik-basierten Systemen zu finden. Nach einer vergleichenden Studie der aktuellen Forschungssituation im Bereich Semantic Desktop und Personal Information Management wird dieses Modell auf drei Ebenen vorgestellt. ZunĂ€chst wird ein abstraktes Modell sowie eine Abfrage-Algebra in Form von abstrakten Operationen auf dieses Modell vorgestellt. Dieses Modell erlaubt die Abbildung von im Personal Information Management gebrĂ€uchlichen Daten, aber erfordert keine völlige Umstellung auf Seiten des Benutzers. Anschließend wird dieses abstrakte Modell in konkreten ReprĂ€sentationen ĂŒbergefĂŒhrt, und es wird gezeigt, wie diese ReprĂ€sentationen effizient bearbeitet, gespeichert, und ausgetauscht werden können. Schließlich wird die Anwendung dieses Modells anhand von konkreten prototypischen Implementierungen gezeigt.The extent to which our daily lives are digitized is continuously growing. Many of our everyday activities manifest themselves in digital form; either in an explicit way, when we actively use digital information for work or spare time; or in an implicit way, when information is indirectly created or manipulated as a consequence of our action. A large fraction of these data volumes can be considered as personal information, that is, information that has a certain class of relationship to us as human beings. The storage and processing capacity of the devices that we use to interact with these data has been enormously increasing over the last years, and we can expect this development to continue in the future. However, while the power of physical data storage is permanently increasing, the development of logical data organization power of personal devices has been stagnating since the invention of the first personal computers. Still, hierarchical file systems are the de-facto standard for data organization on personal devices. File systems represent information as a set of discrete data units (files) that are arranged as leaves on a tree of labeled nodes (directories). This structure, on the one hand, can be easily understood by humans, since the separation into small information units supports the manual manageability of the personal data space, in comparison to systems that employ continuous data structures. On the other hand, hierarchical structures suffer from a number of deficiencies which have negative impact on the quality of personal information management, and it lacks of expressive mechanisms which in turn would help to improve information retrieval according to user needs. Significant research effort has been invested in order to improve the mechanisms for personal information management. The resulting works represent potential alternatives or supplements for systems in place, but sometimes run the risk of over-formalizing information management; a problem that is especially apparent in situations where a non-expert end user is the direct consumer of such services. The contribution of this thesis is to present an alternative organizational model for management of personal data that strikes a balance between the unstructured nature of file systems and the highly formal characteristics of logic-based systems. After a comparative analysis of the current situation and recent research effort in this direction, it describes this organizational metaphor on three levels: First, on a conceptual level, it discusses an abstract data model, a corresponding query algebra, and a set of abstract operations on this data model. This formal framework is suitable to represent common data structures and usage patterns that can be found in personal information management, but on the same time does not enforce a complete paradigm shift away from established systems. Second, on a representation level, it discusses how this model can be efficiently processed, stored, and exchanged between different systems. Third, on an implementation level, it describes how concrete realizations of this data model can be built and used in various application scenarios

    Studies related to the process of program development

    Get PDF
    The submitted work consists of a collection of publications arising from research carried out at Rhodes University (1970-1980) and at Heriot-Watt University (1980-1992). The theme of this research is the process of program development, i.e. the process of creating a computer program to solve some particular problem. The papers presented cover a number of different topics which relate to this process, viz. (a) Programming methodology programming. (b) Properties of programming languages. aspects of structured. (c) Formal specification of programming languages. (d) Compiler techniques. (e) Declarative programming languages. (f) Program development aids. (g) Automatic program generation. (h) Databases. (i) Algorithms and applications
    corecore