8,173 research outputs found

    Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

    Get PDF
    We propose an efficient and scalable architecture for processing generalized graph-pattern queries as they are specified by the current W3C recommendation of the SPARQL 1.1 "Query Language" component. Specifically, the class of queries we consider consists of sets of SPARQL triple patterns with labeled property paths. From a relational perspective, this class resolves to conjunctive queries of relational joins with additional graph-reachability predicates. For the scalable, i.e., distributed, processing of this kind of queries over very large RDF collections, we develop a suitable partitioning and indexing scheme, which allows us to shard the RDF triples over an entire cluster of compute nodes and to process an incoming SPARQL query over all of the relevant graph partitions (and thus compute nodes) in parallel. Unlike most prior works in this field, we specifically aim at the unified optimization and distributed processing of queries consisting of both relational joins and graph-reachability predicates. All communication among the compute nodes is established via a proprietary, asynchronous communication protocol based on the Message Passing Interface

    Graph Summarization

    Full text link
    The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

    Finding paths in sparse random graphs requires many queries

    Full text link
    We discuss a new algorithmic type of problem in random graphs studying the minimum number of queries one has to ask about adjacency between pairs of vertices of a random graph GG(n,p)G\sim {\mathcal G}(n,p) in order to find a subgraph which possesses some target property with high probability. In this paper we focus on finding long paths in GG(n,p)G\sim \mathcal G(n,p) when p=1+εnp=\frac{1+\varepsilon}{n} for some fixed constant ε>0\varepsilon>0. This random graph is known to have typically linearly long paths. To have \ell edges with high probability in GG(n,p)G\sim \mathcal G(n,p) one clearly needs to query at least Ω(p)\Omega\left(\frac{\ell}{p}\right) pairs of vertices. Can we find a path of length \ell economically, i.e., by querying roughly that many pairs? We argue that this is not possible and one needs to query significantly more pairs. We prove that any randomised algorithm which finds a path of length =Ω(log(1ε)ε)\ell=\Omega\left(\frac{\log\left(\frac{1}{\varepsilon}\right)}{\varepsilon}\right) with at least constant probability in GG(n,p)G\sim \mathcal G(n,p) with p=1+εnp=\frac{1+\varepsilon}{n} must query at least Ω(pεlog(1ε))\Omega\left(\frac{\ell}{p\varepsilon \log\left(\frac{1}{\varepsilon}\right)}\right) pairs of vertices. This is tight up to the log(1ε)\log\left(\frac{1}{\varepsilon}\right) factor.Comment: 14 page

    Performance Guarantees for Distributed Reachability Queries

    Get PDF
    In the real world a graph is often fragmented and distributed across different sites. This highlights the need for evaluating queries on distributed graphs. This paper proposes distributed evaluation algorithms for three classes of queries: reachability for determining whether one node can reach another, bounded reachability for deciding whether there exists a path of a bounded length between a pair of nodes, and regular reachability for checking whether there exists a path connecting two nodes such that the node labels on the path form a string in a given regular expression. We develop these algorithms based on partial evaluation, to explore parallel computation. When evaluating a query Q on a distributed graph G, we show that these algorithms possess the following performance guarantees, no matter how G is fragmented and distributed: (1) each site is visited only once; (2) the total network traffic is determined by the size of Q and the fragmentation of G, independent of the size of G; and (3) the response time is decided by the largest fragment of G rather than the entire G. In addition, we show that these algorithms can be readily implemented in the MapReduce framework. Using synthetic and real-life data, we experimentally verify that these algorithms are scalable on large graphs, regardless of how the graphs are distributed.Comment: VLDB201

    Reasoning & Querying – State of the Art

    Get PDF
    Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF
    corecore