109,900 research outputs found

    Partout: A Distributed Engine for Efficient RDF Processing

    Full text link
    The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for efficient RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log, allocating the fragments to nodes in a cluster, and finding the optimal configuration. Partout can efficiently handle updates and its query optimizer produces efficient query execution plans for ad-hoc SPARQL queries. Our experiments show the superiority of our approach to state-of-the-art approaches for partitioning and distributed SPARQL query processing

    P-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud

    Get PDF
    Increasingly data on the Web is stored in the form of Semantic Web data. Because of today's information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalable distributed Semantic Web database in the Cloud. In order to reduce the number of (unnecessary) intermediate results early, we apply bloom filters. Instead of computing bloom filters, a time-consuming task during query processing as it has been done traditionally, we precompute the bloom filters as much as possible and store them in the indices besides the data. The experimental results with data sets up to 1 billion triples show that our approach speeds up query processing significantly and sometimes even reduces the processing time to less than half

    Data Aggregation through Web Service Composition in Smart Camera Networks

    Get PDF
    Distributed Smart Camera (DSC) networks are power constrained real-time distributed embedded systems that perform computer vision using multiple cameras. Providing data aggregation techniques that is criti-cal for running complex image processing algorithms on DSCs is a challenging task due to complexity of video and image data. Providing highly desirable SQL APIs for sophisticated query processing in DSC networks is also challenging for similar reasons. Research on DSCs to date have not addressed the above two problems. In this thesis, we develop a novel SOA based middleware framework on a DSC network that uses Distributed OSGi to expose DSC network services as web services. We also develop a novel web service composition scheme that aid in data aggregation and a SQL query interface for DSC net-works that allow sophisticated query processing. We validate our service orchestration concept for data aggregation by providing query primitive for face detection in smart camera network

    Efficient Subgraph Matching on Billion Node Graphs

    Full text link
    The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.Comment: VLDB201

    Query-driven indexing in large-scale distributed systems

    Get PDF
    Efficient and effective search in large-scale data repositories requires complex indexing solutions deployed on a large number of servers. Web search engines such as Google and Yahoo! already rely upon complex systems to be able to return relevant query results and keep processing times within the comfortable sub-second limit. Nevertheless, the exponential growth of the amount of content on the Web poses serious challenges with respect to scalability. Coping with these challenges requires novel indexing solutions that not only remain scalable but also preserve the search accuracy. In this thesis we introduce and explore the concept of query-driven indexing – an index construction strategy that uses caching techniques to adapt to the querying patterns expressed by users. We suggest to abandon the strict difference between indexing and caching, and to build a distributed indexing structure, or a distributed cache, such that it is optimized for the current query load. Our experimental and theoretical analysis shows that employing query-driven indexing is especially beneficial when the content is (geographically) distributed in a Peer-to-Peer network. In such a setting extensive bandwidth consumption has been identified as one of the major obstacles for efficient large-scale search. Our indexing mechanisms combat this problem by maintaining the query popularity statistics and by indexing (caching) intermediate query results that are requested frequently. We present several indexing strategies for processing multi-keyword and XPath queries over distributed collections of textual and XML documents respectively. Experimental evaluations show significant overall traffic reduction compared to the state-of-the-art approaches. We also study possible query-driven optimizations for Web search engine architectures. Contrary to the Peer-to-Peer setting, Web search engines use centralized caching of query results to reduce the processing load on the main index. We analyze real search engine query logs and show that the changes in query traffic that such a results cache induces fundamentally affect indexing performance. In particular, we study its impact on index pruning efficiency. We show that combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines

    Towards Making Distributed RDF processing FLINker

    Get PDF
    In the last decade, the Resource Description Framework (RDF) has become the de-facto standard for publishing semantic data on the Web. This steady adoption has led to a significant increase in the number and volume of available RDF datasets, exceeding the capabilities of traditional RDF stores. This scenario has introduced severe big semantic data challenges when it comes to managing and querying RDF data at Web scale. Despite the existence of various off-the-shelf Big Data platforms, processing RDF in a distributed environment remains a significant challenge. In this position paper, based on an indepth analysis of the state of the art, we propose to manage large RDF datasets in Flink, a well-known scalable distributed Big Data processing framework. Our approach, which we refer to as FLINKer extends the native graph abstraction of Flink, called Gelly, with RDF graph and SPARQL query processing capabilities

    Web platform for learning distributed databases’ queries processing

    Get PDF
    A distributed database is a collection of data stored in different locations of a distributed system. The processing of queries in distributed databases is quite complex but of great importance for information management. Students who have to learn that process have serious difficulties for understanding them. On this work we present a web platform for helping the students learning the processing and optimization of queries in distributed databases. The novelty of this platform is that as far as we know, there is no similar graphical tool. It allows to visualize step by step the different phases of distributed query processing, showing how are they forming, making it easier for the students to understand these concepts. Moreover, having this web platform available, always and everywhere, indirectly have an impact on other competences like encouraging students’ autonomous work and self-learning, adapting the teaching to its one-time necessities and reinforcing the advantages to apply information techniques in the teaching field. The results of the developed tests to validate the platform's functionalities and student's satisfaction were very positive.This work has been developed thanks to the funding of the project PID46-201617 of the Universidad de JaĂ©n
    • 

    corecore