2,417 research outputs found
Space-Query Tradeoffs in Range Subgraph Counting and Listing
This paper initializes the study of range subgraph counting and range subgraph listing, both of which are motivated by the significant demands in practice to perform graph analytics on subgraphs pertinent to only selected, as opposed to all, vertices. In the first problem, there is an undirected graph G where each vertex carries a real-valued attribute. Given an interval q and a pattern Q, a query counts the number of occurrences of Q in the subgraph of G induced by the vertices whose attributes fall in q. The second problem has the same setup except that a query needs to enumerate (rather than count) those occurrences with a small delay. In both problems, our goal is to understand the tradeoff between space usage and query cost, or more specifically: (i) given a target on query efficiency, how much pre-computed information about G must we store? (ii) Or conversely, given a budget on space usage, what is the best query time we can hope for? We establish a suite of upper- and lower-bound results on such tradeoffs for various query patterns
GRAPHiQL: A graph intuitive query language for relational databases
Graph analytics is becoming increasingly popular, driving many important business applications from social network analysis to machine learning. Since most graph data is collected in a relational database, it seems natural to attempt to perform graph analytics within the relational environment. However, SQL, the query language for relational databases, makes it difficult to express graph analytics operations. This is because SQL requires programmers to think in terms of tables and joins, rather than the more natural representation of graphs as collections of nodes and edges. As a result, even relatively simple graph operations can require very complex SQL queries. In this paper, we present GRAPHiQL, an intuitive query language for graph analytics, which allows developers to reason in terms of nodes and edges. GRAPHiQL provides key graph constructs such as looping, recursion, and neighborhood operations. At runtime, GRAPHiQL compiles graph programs into efficient SQL queries that can run on any relational database. We demonstrate the applicability of GRAPHiQL on several applications and compare the performance of GRAPHiQL queries with those of Apache Giraph (a popular `vertex centric' graph programming language)
Efficient External-Memory Algorithms for Graph Mining
The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs.
We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O.
We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions.
We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank
Unbalanced Triangle Detection and Enumeration Hardness for Unions of Conjunctive Queries
We study the enumeration of answers to Unions of Conjunctive Queries (UCQs)
with optimal time guarantees. More precisely, we wish to identify the queries
that can be solved with linear preprocessing time and constant delay. Despite
the basic nature of this problem, it was shown only recently that UCQs can be
solved within these time bounds if they admit free-connex union extensions,
even if all individual CQs in the union are intractable with respect to the
same complexity measure. Our goal is to understand whether there exist
additional tractable UCQs, not covered by the currently known algorithms. As a
first step, we show that some previously unclassified UCQs are hard using the
classic 3SUM hypothesis, via a known reduction from 3SUM to triangle listing in
graphs. As a second step, we identify a question about a variant of this graph
task which is unavoidable if we want to classify all self-join free UCQs: is it
possible to decide the existence of a triangle in a vertex-unbalanced
tripartite graph in linear time? We prove that this task is equivalent in
hardness to some family of UCQs. Finally, we show a dichotomy for unions of two
self-join-free CQs if we assume the answer to this question is negative. Our
conclusion is that, to reason about a class of enumeration problems defined by
UCQs, it is enough to study the single decision problem of detecting triangles
in unbalanced graphs. Without a breakthrough for triangle detection, we have no
hope to find an efficient algorithm for additional unions of two self-join free
CQs. On the other hand, if we will one day have such a triangle detection
algorithm, we will immediately obtain an efficient algorithm for a family of
UCQs that are currently not known to be tractable
- …