416 research outputs found

    Declarative Cleaning, Analysis, and Querying of Graph-structured Data

    Get PDF
    Much of today's data including social, biological, sensor, computer, and transportation network data is naturally modeled and represented by graphs. Typically, data describing these networks is observational, and thus noisy and incomplete. Therefore, methods for efficiently managing graph-structured data of this nature are needed, especially with the abundance and increasing sizes of such data. In my dissertation, I develop declarative methods to perform cleaning, analysis and querying of graph-structured data efficiently. For declarative cleaning of graph-structured data, I identify a set of primitives to support the extraction and inference of the underlying true network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and cleaning techniques. The task specification language is based on Datalog with a set of extensions designed to enable different graph cleaning primitives. For declarative analysis, I introduce 'ego-centric pattern census queries', a new type of graph analysis query that supports searching for structural patterns in every node's neighborhood and reporting their counts for further analysis. I define an SQL-based declarative language to support this class of queries, and develop a series of efficient query evaluation algorithms for it. Finally, I present an approach for querying large uncertain graphs that supports reasoning about uncertainty of node attributes, uncertainty of edge existence, and a new type of uncertainty, called identity linkage uncertainty, where a group of nodes can potentially refer to the same real-world entity. I define a probabilistic graph model to capture all these types of uncertainties, and to resolve identity linkage merges. I propose 'context-aware path indexing' and 'join-candidate reduction' methods to efficiently enable subgraph matching queries over large uncertain graphs of this type

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    What Else Can Voronoi Diagrams Do for Diameter in Planar Graphs?

    Get PDF

    Improving Anycast with Measurements

    Get PDF
    Since the first Distributed Denial-of-Service (DDoS) attacks were launched, the strength of such attacks has been steadily increasing, from a few megabits per second to well into the terabit/s range. The damage that these attacks cause, mostly in terms of financial cost, has prompted researchers and operators alike to investigate and implement mitigation strategies. Examples of such strategies include local filtering appliances, Border Gateway Protocol (BGP)-based blackholing and outsourced mitigation in the form of cloud-based DDoS protection providers. Some of these strategies are more suited towards high bandwidth DDoS attacks than others. For example, using a local filtering appliance means that all the attack traffic will still pass through the owner's network. This inherently limits the maximum capacity of such a device to the bandwidth that is available. BGP Blackholing does not have such limitations, but can, as a side-effect, cause service disruptions to end-users. A different strategy, that has not attracted much attention in academia, is based on anycast. Anycast is a technique that allows operators to replicate their service across different physical locations, while keeping that service addressable with just a single IP-address. It relies on the BGP to effectively load balance users. In practice, it is combined with other mitigation strategies to allow those to scale up. Operators can use anycast to scale their mitigation capacity horizontally. Because anycast relies on BGP, and therefore in essence on the Internet itself, it can be difficult for network engineers to fine tune this balancing behavior. In this thesis, we show that that is indeed the case through two different case studies. In the first, we focus on an anycast service during normal operations, namely the Google Public DNS, and show that the routing within this service is far from optimal, for example in terms of distance between the client and the server. In the second case study, we observe the root DNS, while it is under attack, and show that even though in aggregate the bandwidth available to this service exceeds the attack we observed, clients still experienced service degradation. This degradation was caused due to the fact that some sites of the anycast service received a much higher share of traffic than others. In order for operators to improve their anycast networks, and optimize it in terms of resilience against DDoS attacks, a method to assess the actual state of such a network is required. Existing methodologies typically rely on external vantage points, such as those provided by RIPE Atlas, and are therefore limited in scale, and inherently biased in terms of distribution. We propose a new measurement methodology, named Verfploeter, to assess the characteristics of anycast networks in terms of client to Point-of-Presence (PoP) mapping, i.e. the anycast catchment. This method does not rely on external vantage points, is free of bias and offers a much higher resolution than any previous method. We validated this methodology by deploying it on a testbed that was locally developed, as well as on the B root DNS. We showed that the increased \textit{resolution} of this methodology improved our ability to assess the impact of changes in the network configuration, when compared to previous methodologies. As final validation we implement Verfploeter on Cloudflare's global-scale anycast Content Delivery Network (CDN), which has almost 200 global Points-of-Presence and an aggregate bandwidth of 30 Tbit/s. Through three real-world use cases, we demonstrate the benefits of our methodology: Firstly, we show that changes that occur when withdrawing routes from certain PoPs can be accurately mapped, and that in certain cases the effect of taking down a combination of PoPs can be calculated from individual measurements. Secondly, we show that Verfploeter largely reinstates the ping to its former glory, showing how it can be used to troubleshoot network connectivity issues in an anycast context. Thirdly, we demonstrate how accurate anycast catchment maps offer operators a new and highly accurate tool to identify and filter spoofed traffic. Where possible, we make datasets collected over the course of the research in this thesis available as open access data. The two best (open) dataset awards that were awarded for these datasets confirm that they are a valued contribution. In summary, we have investigated two large anycast services and have shown that their deployments are not optimal. We developed a novel measurement methodology, that is free of bias and is able to obtain highly accurate anycast catchment mappings. By implementing this methodology and deploying it on a global-scale anycast network we show that our method adds significant value to the fast-growing anycast CDN industry and enables new ways of detecting, filtering and mitigating DDoS attacks

    Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing

    Get PDF

    High Performance Density-Based Clustering on Massive Data

    Get PDF

    A State-Based Proactive Approach To Network Isolation Verification In Clouds

    Get PDF
    The multi-tenancy nature of public clouds usually leads to cloud tenants' concerns over network isolation around their virtual resources. Verifying network isolation in clouds faces unique challenges. The sheer size of virtual infrastructures paired with the self-serviced nature of clouds means the verification will likely have a high complexity and yet its results may become obsolete in seconds. Moreover, the _ne-grained and distributed network access control (e.g., per-VM security group rules) typical to virtual cloud infrastructures means the verification must examine not only the events but also the current state of the infrastructures. In this thesis, we propose VMGuard, a state-based proactive approach for efficiently verifying large-scale virtual infrastructures against network isolation policies. Informally, our key idea is to proactively trigger the verification based on predicted events and their simulated impact upon the current state, such that we can have the best of both worlds, i.e., the efficiency of a proactive approach and the effectiveness of state-based verification. We implement and evaluate VMGuard based on OpenStack, and our experiments with both real and synthetic data demonstrate the performance and efficiency

    Mining and Managing Large-Scale Temporal Graphs

    Get PDF
    Large-scale temporal graphs are everywhere in our daily life. From online social networks, mobile networks, brain networks to computer systems, entities in these large complex systems communicate with each other, and their interactions evolve over time. Unlike traditional graphs, temporal graphs are dynamic: both topologies and attributes on nodes/edges may change over time. On the one hand, the dynamics have inspired new applications that rely on mining and managing temporal graphs. On the other hand, the dynamics also raise new technical challenges. First, it is difficult to discover or retrieve knowledge from complex temporal graph data. Second, because of the extra time dimension, we also face new scalability problems. To address these new challenges, we need to develop new methods that model temporal information in graphs so that we can deliver useful knowledge, new queries with temporal and structural constraints where users can obtain the desired knowledge, and new algorithms that are cost-effective for both mining and management tasks.In this dissertation, we discuss our recent works on mining and managing large-scale temporal graphs.First, we investigate two mining problems, including node ranking and link prediction problems. In these works, temporal graphs are applied to model the data generated from computer systems and online social networks. We formulate data mining tasks that extract knowledge from temporal graphs. The discovered knowledge can help domain experts identify critical alerts in system monitoring applications and recover the complete traces for information propagation in online social networks. To address computation efficiency problems, we leverage the unique properties in temporal graphs to simplify mining processes. The resulting mining algorithms scale well with large-scale temporal graphs with millions of nodes and billions of edges. By experimental studies over real-life and synthetic data, we confirm the effectiveness and efficiency of our algorithms.Second, we focus on temporal graph management problems. In these study, temporal graphs are used to model datacenter networks, mobile networks, and subscription relationships between stream queries and data sources. We formulate graph queries to retrieve knowledge that supports applications in cloud service placement, information routing in mobile networks, and query assignment in stream processing system. We investigate three types of queries, including subgraph matching, temporal reachability, and graph partitioning. By utilizing the relatively stable components in these temporal graphs, we develop flexible data management techniques to enable fast query processing and handle graph dynamics. We evaluate the soundness of the proposed techniques by both real and synthetic data. Through these study, we have learned valuable lessons. For temporal graph mining, temporal dimension may not necessarily increase computation complexity; instead, it may reduce computation complexity if temporal information can be wisely utilized. For temporal graph management, temporal graphs may include relatively stable components in real applications, which can help us develop flexible data management techniques that enable fast query processing and handle dynamic changes in temporal graphs

    Acta Cybernetica : Volume 21. Number 3.

    Get PDF

    Multimodal Code Search

    Get PDF
    • …
    corecore