12 research outputs found
TopCom: Index for Shortest Distance Query in Directed Graph
Finding shortest distance between two vertices in a graph is an important
problem due to its numerous applications in diverse domains, including
geo-spatial databases, social network analysis, and information retrieval.
Classical algorithms (such as, Dijkstra) solve this problem in polynomial time,
but these algorithms cannot provide real-time response for a large number of
bursty queries on a large graph. So, indexing based solutions that pre-process
the graph for efficiently answering (exactly or approximately) a large number
of distance queries in real-time is becoming increasingly popular. Existing
solutions have varying performance in terms of index size, index building time,
query time, and accuracy. In this work, we propose T OP C OM , a novel
indexing-based solution for exactly answering distance queries. Our experiments
with two of the existing state-of-the-art methods (IS-Label and TreeMap) show
the superiority of T OP C OM over these two methods considering scalability and
query time. Besides, indexing of T OP C OM exploits the DAG (directed acyclic
graph) structure in the graph, which makes it significantly faster than the
existing methods if the SCCs (strongly connected component) of the input graph
are relatively small
A framework for identifying genotypic information from clinical records: exploiting integrated ontology structures to transfer annotations between ICD codes and Gene Ontologies
Although some methods are proposed for automatic ontology generation, none of them address the issue of integrating large-scale heterogeneous biomedical ontologies. We propose a novel approach for integrating various types of ontologies efficiently and apply it to integrate International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9CM) and Gene Ontologies (GO). This approach is one of the early attempts to quantify the associations among clinical terms (e.g. ICD9 codes) based on their corresponding genomic relationships. We reconstructed a merged tree for a partial set of GO and ICD9 codes and measured the performance of this tree in terms of associations’ relevance by comparing them with two well-known disease-gene datasets (i.e. MalaCards and Disease Ontology). Furthermore, we compared the genomic-based ICD9 associations to temporal relationships between them from electronic health records. Our analysis shows promising associations supported by both comparisons suggesting a high reliability. We also manually analyzed several significant associations and found promising support from literature
Efficient Computation of Distance Labeling for Decremental Updates in Large Dynamic Graphs
Since today's real-world graphs, such as social network graphs, are evolving all the time, it is of great importance to perform graph computations and analysis in these dynamic graphs. Due to the fact that many applications such as social network link analysis with the existence of inactive users need to handle failed links or nodes, decremental computation and maintenance for graphs is considered a challenging problem. Shortest path computation is one of the most fundamental operations for managing and analyzing large graphs. A number of indexing methods have been proposed to answer distance queries in static graphs. Unfortunately, there is little work on answering such queries for dynamic graphs. In this paper, we focus on the problem of computing the shortest path distance in dynamic graphs, particularly on decremental updates (i.e., edge deletions). We propose maintenance algorithms based on distance labeling, which can handle decremental updates efficiently. By exploiting properties of distance labeling in original graphs, we are able to efficiently maintain distance labeling for new graphs. We experimentally evaluate our algorithms using eleven real-world large graphs and confirm the effectiveness and efficiency of our approach. More specifically, our method can speed up index re-computation by up to an order of magnitude compared with the state-of-the-art method, Pruned Landmark Labeling (PLL)
Fast Exact Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling
We propose a new exact method for shortest-path distance queries on
large-scale networks. Our method precomputes distance labels for vertices by
performing a breadth-first search from every vertex. Seemingly too obvious and
too inefficient at first glance, the key ingredient introduced here is pruning
during breadth-first searches. While we can still answer the correct distance
for any pair of vertices from the labels, it surprisingly reduces the search
space and sizes of labels. Moreover, we show that we can perform 32 or 64
breadth-first searches simultaneously exploiting bitwise operations. We
experimentally demonstrate that the combination of these two techniques is
efficient and robust on various kinds of large-scale real-world networks. In
particular, our method can handle social networks and web graphs with hundreds
of millions of edges, which are two orders of magnitude larger than the limits
of previous exact methods, with comparable query time to those of previous
methods.Comment: To appear in SIGMOD 201
K-Reach: Who is in Your Small World
We study the problem of answering k-hop reachability queries in a directed
graph, i.e., whether there exists a directed path of length k, from a source
query vertex to a target query vertex in the input graph. The problem of k-hop
reachability is a general problem of the classic reachability (where
k=infinity). Existing indexes for processing classic reachability queries, as
well as for processing shortest path queries, are not applicable or not
efficient for processing k-hop reachability queries. We propose an index for
processing k-hop reachability queries, which is simple in design and efficient
to construct. Our experimental results on a wide range of real datasets show
that our index is more efficient than the state-of-the-art indexes even for
processing classic reachability queries, for which these indexes are primarily
designed. We also show that our index is efficient in answering k-hop
reachability queries.Comment: VLDB201
Joint search by social and spatial proximity
Ministry of Education, Singapore under its Academic Research Funding Tier
On the Evaluation of Pattern Match Queries in Large Graph Databases
Recently, graph databases have been received much attention in the research community due to their extensive applications in practice, such as social networks, biological networks and World Wide Web, which bring forth a lot of challenging data management problems including subgraph search, shortest-path query, reachability verification, pattern matching, and so on. Among them, the graph pattern matching is to find all matches in a data graph for a given pattern graph and is more general and flexible than other problems mentioned above. In this thesis, we address a kind of graph matching, the so-called pattern matching with δ, by which an edge in is allowed to match a path of length ≤ δ in . In order to reduce the search space when exploring to find matches, we propose a novel pruning algorithm to eliminate all unqualified vertices. We also propose a strategy to speed up the distance-based join over two lists of vertices. Extensive experiments have been conducted, which show that our approach makes great improvements in running time compared to existing ones.Master of Science in Applied Computer Scienc
The Graph Pattern Matching Problem through Parameterized Matching
We propose a new approach to solve graph isomorphism using parameterized matching. Parameterized matching is a string matching problem where two strings parameterized-match if there exists a bijective function, on the symbols of the alphabet, that maps one of the strings into the other. Given that parameterized matching is defined for linear structures, we define the concept of graph linearization to represent the topology of a graph as a walk on it. Then, our approach to determine whether two graphs are isomorphic consists of determining whether there exists a walk in one of the graphs that parameterized-matches a linearization of the other graph. Our solution has two main steps: linearization and matching. We develop an efficient linearization algorithm, that generates short linearizations with an approximation guarantee, and develop a graph matching algorithm. We show that this solution also works for subgraph isomorphism, which is the problem of determining whether an input graph H is isomorphic to a subgraph of another input graph G. We evaluate our approach experimentally on graphs of different types and sizes, and compare to the performance of VF2, which is a prominent algorithm for graph isomorphism. Our empirical measurements show that graph linearization finds a matching graph faster than VF2 in many cases, especially in Miyazaki-constructed graphs which are known to be one of the hardest cases for graph isomorphism algorithms. We extend this approach to query attributed graphs. An attributed graph is a graph data structure, in which nodes and edges may have identifiers, types and other attributes. Attributed graphs are used in many application domains, for example to model social networks in which nodes represent people, photos, and postings and edges represent friendship, person-tagged-in-photo and mentioned-in-post relationships. Queries are used to extract information from such graphs. Several graph queries are expressed as graph pattern matching, which is the problem of finding all instances of pattern match query P in a larger attributed graph G. A pattern match query may specify both a graph structure and predicates on the attributes of the graph elements. Clearly, this problem is associated to subgraph isomorphism. Furthermore, we define a more general class of graph queries called generalized pattern queries on attributed multigraphs. The goal of this class is to find paths and subgraphs that satisfy query reachability and predicates. The query language is expressive: It allows (i) using regular expression operators (e.g., Kleene star and union); (ii) specifying structural predicates on graph nodes and edges; and (iii) using attribute predicates on nodes and edges. Pattern match queries, reachability queries, their combination, and even more queries can be expressed through generalized pattern queries. We use our approach to solve this new type of queries. The proposed technique has two phases. First, the query is linearized, i.e., represented as a graph walk that covers all nodes and edges. There are several linearizations for a given query; we derive heuristics to produce a good linearization that is short and places selective predicates early in the linearization. Second, we search for a bijective function that maps each element of the query to an element of the attributed multigraph that satisfies the reachability requirements and the predicates. Specifically, we develop an algorithm that matches the linearization by traversing the attributed graph in a manner similar to a breadth first traversal constrained by the linearization. We evaluate our solution experimentally using a real graph (the DBLP citation network) to assess its practicality and efficiency. Our results show that our techniques and optimizations are effective in querying attributed graphs, offering several factors of reduction in query response time when graph statistics are utilized.Resumen. En esta tesis se propone un nuevo enfoque de solución para resolver el problema de isomorfismo de grafos usando búsqueda parametrizada. La búsqueda parametrizada es un problema de búsqueda de cadenas de texto en el cual dos cadenas coinciden si existe una biyección que mapee los símbolos de una cadena en los símbolos de la otra. Dado que la búsqueda parametrizada está definida para estructuras lineales, se define el concepto de linearización de grafos para representar la topología de un grafo como un camino sobre este. Entonces, la solución para determinar si dos grafos son isomorfos consiste en determinar si existe un camino en uno de los grafos que haga coincidencia parametrizada con la linearización del otro grafo. La solución propuesta tiene dos pasos: linearización y búsqueda. Se presenta un algoritmo eficiente que genera linearizaciones aproximadamente óptimas en longitud, y un algoritmo de búsqueda. Se demuestra que esta solución también resuelve el problema de isomorfismo de subgrafos, en el cual se determina si un grafo H es isomorfo a un subgrafo de otro grafo G. Se evaluó experimentalmente la solución con grafos de diferentes tipos y tamaños. Se comparó su desempeño con el de VF2, el cual es un algoritmo competitivo de isomorfismo de grafos. Los resultados experimentales muestran que la solución propuesta es más eficiente que VF2 en varios casos, en especial en grafos Miyazaki, los cuales se caracterizan por constituir uno de los casos más difíciles para los algoritmos de isomorfismo de grafos. Este enfoque de solución se extiende para resolver consultas sobre grafos semánticos. Un grafo semántico es un grafo en el cual los nodes y arcos pueden tener identificadores, tipos y otros atributos. Estos grafos tienen aplicaciones importantes en diversas áreas, como por ejemplo para modelar redes sociales en las que los nodos representan personas, fotos y publicaciones y los arcos representan relaciones de amistad, etiquetado y mención. Se usan consultas para extraer información de estos grafos. Varias de estas consultas se expresan como búsqueda de patrones, la cual consiste en encontrar las coincidencias del grafo patrón P en un grafo semántico G. El grafo patrón especifica tanto la estructura de las coincidencias a encontrar, como los predicados sobre los atributos que deben cumplir los nodos y los arcos de las mismas. Claramente, este problema está asociado al isomorfismo de subgrafos. Además, se define un tipo de consultas más general sobre grafos semánticos. Estos nuevos patrones se denominan grafos patrón generalizados. El objetivo de estos es encontrar caminos y subgrafos que satisfagan ciertos requisitos semánticos, de estructura y de alcance. Estos patrones son expresivos, pues permiten (i) usar operadores de expresiones regulares (e.g., la estrella de Kleene y la unión); (ii) especificar predicados estructurales en los nodos y arcos; y (iii) evaluar predicados sobre los atributos de los nodos y arcos. Los patrones grafo tradicionales, las consultas de alcance, la combinación de estos y más se pueden representar a través de grafos patrón generalizados. Se usa el enfoque de solución propuesto para resolver los grafos patrón generalizados. La solución tiene dos fases. Primero, el patrón es linearizado, i.e., representado como un camino que incluye todos sus nodos y arcos. Hay muchas linearizaciones para un patrón dado; se proponen heurísticas para producir linearizaciones cortas que ubican los predicados selectivos al comienzo. Segundo, se busca una función biyectiva que mapee cada nodo en el patrón a un nodo en el grafo semántico que satisfaga los requisitos de alcance y los predicados. Específicamente, se propone un algoritmo de búsqueda que recorre el grafo semántico siguiendo una búsqueda en amplitud restringida por la linearización. La solución se evaluó experimentalmente usando un grafo semántico real (la red de citaciones DBLP) para evaluar su practicidad y eficiencia. Los resultados experimentales muestran que las técnicas y optimizaciones propuestas son efectivas en consultar grafos semánticos, ofreciendo un alto factor de reducción en el tiempo de ejecución cuando se utilizan las estadísticas del grafo semántico.Doctorad
枝刈りラベリング法による大規模グラフ上の体系的なクエリ処理
学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 小林 直樹, 東京大学教授 萩谷 昌己, 東京大学教授 須田 礼仁, 東京大学准教授 渋谷 哲朗, 東京大学教授 定兼 邦彦, 東京大学教授 岩田 覚University of Tokyo(東京大学