12 research outputs found

    TopCom: Index for Shortest Distance Query in Directed Graph

    Get PDF
    Finding shortest distance between two vertices in a graph is an important problem due to its numerous applications in diverse domains, including geo-spatial databases, social network analysis, and information retrieval. Classical algorithms (such as, Dijkstra) solve this problem in polynomial time, but these algorithms cannot provide real-time response for a large number of bursty queries on a large graph. So, indexing based solutions that pre-process the graph for efficiently answering (exactly or approximately) a large number of distance queries in real-time is becoming increasingly popular. Existing solutions have varying performance in terms of index size, index building time, query time, and accuracy. In this work, we propose T OP C OM , a novel indexing-based solution for exactly answering distance queries. Our experiments with two of the existing state-of-the-art methods (IS-Label and TreeMap) show the superiority of T OP C OM over these two methods considering scalability and query time. Besides, indexing of T OP C OM exploits the DAG (directed acyclic graph) structure in the graph, which makes it significantly faster than the existing methods if the SCCs (strongly connected component) of the input graph are relatively small

    A framework for identifying genotypic information from clinical records: exploiting integrated ontology structures to transfer annotations between ICD codes and Gene Ontologies

    Get PDF
    Although some methods are proposed for automatic ontology generation, none of them address the issue of integrating large-scale heterogeneous biomedical ontologies. We propose a novel approach for integrating various types of ontologies efficiently and apply it to integrate International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9CM) and Gene Ontologies (GO). This approach is one of the early attempts to quantify the associations among clinical terms (e.g. ICD9 codes) based on their corresponding genomic relationships. We reconstructed a merged tree for a partial set of GO and ICD9 codes and measured the performance of this tree in terms of associations’ relevance by comparing them with two well-known disease-gene datasets (i.e. MalaCards and Disease Ontology). Furthermore, we compared the genomic-based ICD9 associations to temporal relationships between them from electronic health records. Our analysis shows promising associations supported by both comparisons suggesting a high reliability. We also manually analyzed several significant associations and found promising support from literature

    Efficient Computation of Distance Labeling for Decremental Updates in Large Dynamic Graphs

    Get PDF
    Since today's real-world graphs, such as social network graphs, are evolving all the time, it is of great importance to perform graph computations and analysis in these dynamic graphs. Due to the fact that many applications such as social network link analysis with the existence of inactive users need to handle failed links or nodes, decremental computation and maintenance for graphs is considered a challenging problem. Shortest path computation is one of the most fundamental operations for managing and analyzing large graphs. A number of indexing methods have been proposed to answer distance queries in static graphs. Unfortunately, there is little work on answering such queries for dynamic graphs. In this paper, we focus on the problem of computing the shortest path distance in dynamic graphs, particularly on decremental updates (i.e., edge deletions). We propose maintenance algorithms based on distance labeling, which can handle decremental updates efficiently. By exploiting properties of distance labeling in original graphs, we are able to efficiently maintain distance labeling for new graphs. We experimentally evaluate our algorithms using eleven real-world large graphs and confirm the effectiveness and efficiency of our approach. More specifically, our method can speed up index re-computation by up to an order of magnitude compared with the state-of-the-art method, Pruned Landmark Labeling (PLL)

    Fast Exact Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling

    Full text link
    We propose a new exact method for shortest-path distance queries on large-scale networks. Our method precomputes distance labels for vertices by performing a breadth-first search from every vertex. Seemingly too obvious and too inefficient at first glance, the key ingredient introduced here is pruning during breadth-first searches. While we can still answer the correct distance for any pair of vertices from the labels, it surprisingly reduces the search space and sizes of labels. Moreover, we show that we can perform 32 or 64 breadth-first searches simultaneously exploiting bitwise operations. We experimentally demonstrate that the combination of these two techniques is efficient and robust on various kinds of large-scale real-world networks. In particular, our method can handle social networks and web graphs with hundreds of millions of edges, which are two orders of magnitude larger than the limits of previous exact methods, with comparable query time to those of previous methods.Comment: To appear in SIGMOD 201

    K-Reach: Who is in Your Small World

    Full text link
    We study the problem of answering k-hop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of k-hop reachability is a general problem of the classic reachability (where k=infinity). Existing indexes for processing classic reachability queries, as well as for processing shortest path queries, are not applicable or not efficient for processing k-hop reachability queries. We propose an index for processing k-hop reachability queries, which is simple in design and efficient to construct. Our experimental results on a wide range of real datasets show that our index is more efficient than the state-of-the-art indexes even for processing classic reachability queries, for which these indexes are primarily designed. We also show that our index is efficient in answering k-hop reachability queries.Comment: VLDB201

    Joint search by social and spatial proximity

    Get PDF
    Ministry of Education, Singapore under its Academic Research Funding Tier

    On the Evaluation of Pattern Match Queries in Large Graph Databases

    Get PDF
    Recently, graph databases have been received much attention in the research community due to their extensive applications in practice, such as social networks, biological networks and World Wide Web, which bring forth a lot of challenging data management problems including subgraph search, shortest-path query, reachability verification, pattern matching, and so on. Among them, the graph pattern matching is to find all matches in a data graph for a given pattern graph and is more general and flexible than other problems mentioned above. In this thesis, we address a kind of graph matching, the so-called pattern matching with δ, by which an edge in is allowed to match a path of length ≤ δ in . In order to reduce the search space when exploring to find matches, we propose a novel pruning algorithm to eliminate all unqualified vertices. We also propose a strategy to speed up the distance-based join over two lists of vertices. Extensive experiments have been conducted, which show that our approach makes great improvements in running time compared to existing ones.Master of Science in Applied Computer Scienc

    The Graph Pattern Matching Problem through Parameterized Matching

    Get PDF
    We propose a new approach to solve graph isomorphism using parameterized matching. Parameterized matching is a string matching problem where two strings parameterized-match if there exists a bijective function, on the symbols of the alphabet, that maps one of the strings into the other. Given that parameterized matching is defined for linear structures, we define the concept of graph linearization to represent the topology of a graph as a walk on it. Then, our approach to determine whether two graphs are isomorphic consists of determining whether there exists a walk in one of the graphs that parameterized-matches a linearization of the other graph. Our solution has two main steps: linearization and matching. We develop an efficient linearization algorithm, that generates short linearizations with an approximation guarantee, and develop a graph matching algorithm. We show that this solution also works for subgraph isomorphism, which is the problem of determining whether an input graph H is isomorphic to a subgraph of another input graph G. We evaluate our approach experimentally on graphs of different types and sizes, and compare to the performance of VF2, which is a prominent algorithm for graph isomorphism. Our empirical measurements show that graph linearization finds a matching graph faster than VF2 in many cases, especially in Miyazaki-constructed graphs which are known to be one of the hardest cases for graph isomorphism algorithms. We extend this approach to query attributed graphs. An attributed graph is a graph data structure, in which nodes and edges may have identifiers, types and other attributes. Attributed graphs are used in many application domains, for example to model social networks in which nodes represent people, photos, and postings and edges represent friendship, person-tagged-in-photo and mentioned-in-post relationships. Queries are used to extract information from such graphs. Several graph queries are expressed as graph pattern matching, which is the problem of finding all instances of pattern match query P in a larger attributed graph G. A pattern match query may specify both a graph structure and predicates on the attributes of the graph elements. Clearly, this problem is associated to subgraph isomorphism. Furthermore, we define a more general class of graph queries called generalized pattern queries on attributed multigraphs. The goal of this class is to find paths and subgraphs that satisfy query reachability and predicates. The query language is expressive: It allows (i) using regular expression operators (e.g., Kleene star and union); (ii) specifying structural predicates on graph nodes and edges; and (iii) using attribute predicates on nodes and edges. Pattern match queries, reachability queries, their combination, and even more queries can be expressed through generalized pattern queries. We use our approach to solve this new type of queries. The proposed technique has two phases. First, the query is linearized, i.e., represented as a graph walk that covers all nodes and edges. There are several linearizations for a given query; we derive heuristics to produce a good linearization that is short and places selective predicates early in the linearization. Second, we search for a bijective function that maps each element of the query to an element of the attributed multigraph that satisfies the reachability requirements and the predicates. Specifically, we develop an algorithm that matches the linearization by traversing the attributed graph in a manner similar to a breadth first traversal constrained by the linearization. We evaluate our solution experimentally using a real graph (the DBLP citation network) to assess its practicality and efficiency. Our results show that our techniques and optimizations are effective in querying attributed graphs, offering several factors of reduction in query response time when graph statistics are utilized.Resumen. En esta tesis se propone un nuevo enfoque de solución para resolver el problema de isomorfismo de grafos usando búsqueda parametrizada. La búsqueda parametrizada es un problema de búsqueda de cadenas de texto en el cual dos cadenas coinciden si existe una biyección que mapee los símbolos de una cadena en los símbolos de la otra. Dado que la búsqueda parametrizada está definida para estructuras lineales, se define el concepto de linearización de grafos para representar la topología de un grafo como un camino sobre este. Entonces, la solución para determinar si dos grafos son isomorfos consiste en determinar si existe un camino en uno de los grafos que haga coincidencia parametrizada con la linearización del otro grafo. La solución propuesta tiene dos pasos: linearización y búsqueda. Se presenta un algoritmo eficiente que genera linearizaciones aproximadamente óptimas en longitud, y un algoritmo de búsqueda. Se demuestra que esta solución también resuelve el problema de isomorfismo de subgrafos, en el cual se determina si un grafo H es isomorfo a un subgrafo de otro grafo G. Se evaluó experimentalmente la solución con grafos de diferentes tipos y tamaños. Se comparó su desempeño con el de VF2, el cual es un algoritmo competitivo de isomorfismo de grafos. Los resultados experimentales muestran que la solución propuesta es más eficiente que VF2 en varios casos, en especial en grafos Miyazaki, los cuales se caracterizan por constituir uno de los casos más difíciles para los algoritmos de isomorfismo de grafos. Este enfoque de solución se extiende para resolver consultas sobre grafos semánticos. Un grafo semántico es un grafo en el cual los nodes y arcos pueden tener identificadores, tipos y otros atributos. Estos grafos tienen aplicaciones importantes en diversas áreas, como por ejemplo para modelar redes sociales en las que los nodos representan personas, fotos y publicaciones y los arcos representan relaciones de amistad, etiquetado y mención. Se usan consultas para extraer información de estos grafos. Varias de estas consultas se expresan como búsqueda de patrones, la cual consiste en encontrar las coincidencias del grafo patrón P en un grafo semántico G. El grafo patrón especifica tanto la estructura de las coincidencias a encontrar, como los predicados sobre los atributos que deben cumplir los nodos y los arcos de las mismas. Claramente, este problema está asociado al isomorfismo de subgrafos. Además, se define un tipo de consultas más general sobre grafos semánticos. Estos nuevos patrones se denominan grafos patrón generalizados. El objetivo de estos es encontrar caminos y subgrafos que satisfagan ciertos requisitos semánticos, de estructura y de alcance. Estos patrones son expresivos, pues permiten (i) usar operadores de expresiones regulares (e.g., la estrella de Kleene y la unión); (ii) especificar predicados estructurales en los nodos y arcos; y (iii) evaluar predicados sobre los atributos de los nodos y arcos. Los patrones grafo tradicionales, las consultas de alcance, la combinación de estos y más se pueden representar a través de grafos patrón generalizados. Se usa el enfoque de solución propuesto para resolver los grafos patrón generalizados. La solución tiene dos fases. Primero, el patrón es linearizado, i.e., representado como un camino que incluye todos sus nodos y arcos. Hay muchas linearizaciones para un patrón dado; se proponen heurísticas para producir linearizaciones cortas que ubican los predicados selectivos al comienzo. Segundo, se busca una función biyectiva que mapee cada nodo en el patrón a un nodo en el grafo semántico que satisfaga los requisitos de alcance y los predicados. Específicamente, se propone un algoritmo de búsqueda que recorre el grafo semántico siguiendo una búsqueda en amplitud restringida por la linearización. La solución se evaluó experimentalmente usando un grafo semántico real (la red de citaciones DBLP) para evaluar su practicidad y eficiencia. Los resultados experimentales muestran que las técnicas y optimizaciones propuestas son efectivas en consultar grafos semánticos, ofreciendo un alto factor de reducción en el tiempo de ejecución cuando se utilizan las estadísticas del grafo semántico.Doctorad

    枝刈りラベリング法による大規模グラフ上の体系的なクエリ処理

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 小林 直樹, 東京大学教授 萩谷 昌己, 東京大学教授 須田 礼仁, 東京大学准教授 渋谷 哲朗, 東京大学教授 定兼 邦彦, 東京大学教授 岩田 覚University of Tokyo(東京大学
    corecore