1,353 research outputs found

    Efficient Subgraph Matching on Billion Node Graphs

    Full text link
    The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.Comment: VLDB201

    Fast Search for Dynamic Multi-Relational Graphs

    Full text link
    Acting on time-critical events by processing ever growing social media or news streams is a major technical challenge. Many of these data sources can be modeled as multi-relational graphs. Continuous queries or techniques to search for rare events that typically arise in monitoring applications have been studied extensively for relational databases. This work is dedicated to answer the question that emerges naturally: how can we efficiently execute a continuous query on a dynamic graph? This paper presents an exact subgraph search algorithm that exploits the temporal characteristics of representative queries for online news or social media monitoring. The algorithm is based on a novel data structure called the Subgraph Join Tree (SJ-Tree) that leverages the structural and semantic characteristics of the underlying multi-relational graph. The paper concludes with extensive experimentation on several real-world datasets that demonstrates the validity of this approach.Comment: SIGMOD Workshop on Dynamic Networks Management and Mining (DyNetMM), 201

    Answering Spatial Multiple-Set Intersection Queries Using 2-3 Cuckoo Hash-Filters

    Full text link
    We show how to answer spatial multiple-set intersection queries in O(n(log w)/w + kt) expected time, where n is the total size of the t sets involved in the query, w is the number of bits in a memory word, k is the output size, and c is any fixed constant. This improves the asymptotic performance over previous solutions and is based on an interesting data structure, known as 2-3 cuckoo hash-filters. Our results apply in the word-RAM model (or practical RAM model), which allows for constant-time bit-parallel operations, such as bitwise AND, OR, NOT, and MSB (most-significant 1-bit), as exist in modern CPUs and GPUs. Our solutions apply to any multiple-set intersection queries in spatial data sets that can be reduced to one-dimensional range queries, such as spatial join queries for one-dimensional points or sets of points stored along space-filling curves, which are used in GIS applications.Comment: Full version of paper from 2017 ACM SIGSPATIAL International Conference on Advances in Geographic Information System

    AsterixDB: A Scalable, Open Source BDMS

    Full text link
    AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements

    Reasoning & Querying – State of the Art

    Get PDF
    Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

    Declarative Cleaning, Analysis, and Querying of Graph-structured Data

    Get PDF
    Much of today's data including social, biological, sensor, computer, and transportation network data is naturally modeled and represented by graphs. Typically, data describing these networks is observational, and thus noisy and incomplete. Therefore, methods for efficiently managing graph-structured data of this nature are needed, especially with the abundance and increasing sizes of such data. In my dissertation, I develop declarative methods to perform cleaning, analysis and querying of graph-structured data efficiently. For declarative cleaning of graph-structured data, I identify a set of primitives to support the extraction and inference of the underlying true network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and cleaning techniques. The task specification language is based on Datalog with a set of extensions designed to enable different graph cleaning primitives. For declarative analysis, I introduce 'ego-centric pattern census queries', a new type of graph analysis query that supports searching for structural patterns in every node's neighborhood and reporting their counts for further analysis. I define an SQL-based declarative language to support this class of queries, and develop a series of efficient query evaluation algorithms for it. Finally, I present an approach for querying large uncertain graphs that supports reasoning about uncertainty of node attributes, uncertainty of edge existence, and a new type of uncertainty, called identity linkage uncertainty, where a group of nodes can potentially refer to the same real-world entity. I define a probabilistic graph model to capture all these types of uncertainties, and to resolve identity linkage merges. I propose 'context-aware path indexing' and 'join-candidate reduction' methods to efficiently enable subgraph matching queries over large uncertain graphs of this type

    Ontology-based Search Algorithms over Large-Scale Unstructured Peer-to-Peer Networks

    Get PDF
    Peer-to-Peer(P2P) systems have emerged as a promising paradigm to structure large scale distributed systems. They provide a robust, scalable and decentralized way to share and publish data.The unstructured P2P systems have gained much popularity in recent years for their wide applicability and simplicity. However efficient resource discovery remains a fundamental challenge for unstructured P2P networks due to the lack of a network structure. To effectively harness the power of unstructured P2P systems, the challenges in distributed knowledge management and information search need to be overcome. Current attempts to solve the problems pertaining to knowledge management and search have focused on simple term based routing indices and keyword search queries. Many P2P resource discovery applications will require more complex query functionality, as users will publish semantically rich data and need efficiently content location algorithms that find target content at moderate cost. Therefore, effective knowledge and data management techniques and search tools for information retrieval are imperative and lasting. In my dissertation, I present a suite of protocols that assist in efficient content location and knowledge management in unstructured Peer-to-Peer overlays. The basis of these schemes is their ability to learn from past peer interactions and increasing their performance with time.My work aims to provide effective and bandwidth-efficient searching and data sharing in unstructured P2P environments. A suite of algorithms which provide peers in unstructured P2P overlays with the state necessary in order to efficiently locate, disseminate and replicate objects is presented. Also, Existing approaches to federated search are adapted and new methods are developed for semantic knowledge representation, resource selection, and knowledge evolution for efficient search in dynamic and distributed P2P network environments. Furthermore,autonomous and decentralized algorithms that reorganizes an unstructured network topology into a one with desired search-enhancing properties are proposed in a network evolution model to facilitate effective and efficient semantic search in dynamic environments

    Sequence queries on temporal graphs

    Get PDF
    Graphs that evolve over time are called temporal graphs. They can be used to describe and represent real-world networks, including transportation networks, social networks, and communication networks, with higher fidelity and accuracy. However, research is still limited on how to manage large scale temporal graphs and execute queries over these graphs efficiently and effectively. This thesis investigates the problems of temporal graph data management related to node and edge sequence queries. In temporal graphs, nodes and edges can evolve over time. Therefore, sequence queries on nodes and edges can be key components in managing temporal graphs. In this thesis, the node sequence query decomposes into two parts: graph node similarity and subsequence matching. For node similarity, this thesis proposes a modified tree edit distance that is metric and polynomially computable and has a natural, intuitive interpretation. Note that the proposed node similarity works even for inter-graph nodes and therefore can be used for graph de-anonymization, network transfer learning, and cross-network mining, among other tasks. The subsequence matching query proposed in this thesis is a framework that can be adopted to index generic sequence and time-series data, including trajectory data and even DNA sequences for subsequence retrieval. For edge sequence queries, this thesis proposes an efficient storage and optimized indexing technique that allows for efficient retrieval of temporal subgraphs that satisfy certain temporal predicates. For this problem, this thesis develops a lightweight data management engine prototype that can support time-sensitive temporal graph analytics efficiently even on a single PC
    • …
    corecore