6 research outputs found

    On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

    Full text link
    Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints.Comment: 16 pages, 3 figure

    Efficient Generation of Rectangulations via Permutation Languages

    Get PDF
    A generic rectangulation is a partition of a rectangle into finitely many interior-disjoint rectangles, such that no four rectangles meet in a point. In this work we present a versatile algorithmic framework for exhaustively generating a large variety of different classes of generic rectangulations. Our algorithms work under very mild assumptions, and apply to a large number of rectangulation classes known from the literature, such as generic rectangulations, diagonal rectangulations, 1-sided/area-universal, block-aligned rectangulations, and their guillotine variants. They also apply to classes of rectangulations that are characterized by avoiding certain patterns, and in this work we initiate a systematic investigation of pattern avoidance in rectangulations. Our generation algorithms are efficient, in some cases even loopless or constant amortized time, i.e., each new rectangulation is generated in constant time in the worst case or on average, respectively. Moreover, the Gray codes we obtain are cyclic, and sometimes provably optimal, in the sense that they correspond to a Hamilton cycle on the skeleton of an underlying polytope. These results are obtained by encoding rectangulations as permutations, and by applying our recently developed permutation language framework

    Efficient generation of rectangulations via permutation languages

    Get PDF
    A generic rectangulation is a partition of a rectangle into finitely many interior-disjoint rectangles, such that no four rectangles meet in a point. In this work we present a versatile algorithmic framework for exhaustively generating a large variety of different classes of generic rectangulations. Our algorithms work under very mild assumptions, and apply to a large number of rectangulation classes known from the literature, such as generic rectangulations, diagonal rectangulations, 1-sided/area-universal, block-aligned rectangulations, and their guillotine variants. They also apply to classes of rectangulations that are characterized by avoiding certain patterns, and in this work we initiate a systematic investigation of pattern avoidance in rectangulations. Our generation algorithms are efficient, in some cases even loopless or constant amortized time, i.e., each new rectangulation is generated in constant time in the worst case or on average, respectively. Moreover, the Gray codes we obtain are cyclic, and sometimes provably optimal, in the sense that they correspond to a Hamilton cycle on the skeleton of an underlying polytope. These results are obtained by encoding rectangulations as permutations, and by applying our recently developed permutation language framework

    Graph database management systems: storage, management and query processing

    Get PDF
    The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the ‘edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters

    Parallel and External High Quality Graph Partitioning

    Get PDF
    Partitioning graphs into k blocks of roughly equal size such that few edges run between the blocks is a key tool for processing and analyzing large complex real-world networks. The graph partitioning problem has multiple practical applications in parallel and distributed computations, data storage, image processing, VLSI physical design and many more. Furthermore, recently, size, variety, and structural complexity of real-world networks has grown dramatically. Therefore, there is a demand for efficient graph partitioning algorithms that fully utilize computational power and memory capacity of modern machines. A popular and successful heuristic to compute a high-quality partitions of large networks in reasonable time is multi-level graph partitioning\textit{multi-level graph partitioning} approach which contracts the graph preserving its structure and then partitions it using a complex graph partitioning algorithm. Specifically, the multi-level graph partitioning approach consists of three main phases: coarsening, initial partitioning, and uncoarsening. During the coarsening phase, the graph is recursively contracted preserving its structure and properties until it is small enough to compute its initial partition during the initial partitioning phase. Afterwards, during the uncoarsening phase the partition of the contracted graph is projected onto the original graph and refined using, for example, local search. Most of the research on heuristical graph partitioning focuses on sequential algorithms or parallel algorithms in the distributed memory model. Unfortunately, previous approaches to graph partitioning are not able to process large networks and rarely take in into account several aspects of modern computational machines. Specifically, the amount of cores per chip grows each year as well as the price of RAM reduces slower than the real-world graphs grow. Since HDDs and SSDs are 50 – 400 times cheaper than RAM, external memory makes it possible to process large real-world graphs for a reasonable price. Therefore, in order to better utilize contemporary computational machines, we develop efficient multi-level graph partitioning\textit{multi-level graph partitioning} algorithms for the shared-memory and the external memory models. First, we present an approach to shared-memory parallel multi-level graph partitioning that guarantees balanced solutions, shows high speed-ups for a variety of large graphs and yields very good quality independently of the number of cores used. Important ingredients include parallel label propagation for both coarsening and uncoarsening, parallel initial partitioning, a simple yet effective approach to parallel localized local search, and fast locality preserving hash tables that effectively utilizes caches. The main idea of the parallel localized local search is that each processors refines only a small area around a random vertex reducing interactions between processors. For example, on 79 cores, our algorithms partitions a graph with more than 3 billions of edges into 16 blocks cutting 4.5% less edges than the closest competitor and being more than two times faster. Furthermore, another competitors is not able to partition this graph. We then present an approach to external memory graph partitioning that is able to partition large graphs that do not fit into RAM. Specifically, we consider the semi-external and the external memory model. In both models a data structure of size proportional to the number of edges does not fit into the RAM. The difference is that the former model assumes that a data structure of size proportional to the number of vertices fits into the RAM whereas the latter assumes the opposite. We address the graph partitioning problem in both models by adapting the size-constrained label propagation technique for the semi-external model and by developing a size-constrained clustering algorithm based on graph coloring in the external memory. Our semi-external size-constrained label propagation algorithm (or external memory clustering algorithm) can be used to compute graph clusterings and is a prerequisite for the (semi-)external graph partitioning algorithm. The algorithms are then used for both the coarsening and the uncoarsening phase of a multi-level algorithm to compute graph partitions. Our (semi-)external algorithm is able to partition and cluster huge complex networks with billions of edges on cheap commodity machines. Experiments demonstrate that the semi-external graph partitioning algorithm is scalable and can compute high quality partitions in time that is comparable to the running time of an efficient internal memory implementation. A parallelization of the algorithm in the semi-external model further reduces running times. Additionally, we develop a speed-up technique for the hypergraph partitioning algorithms. Hypergraphs are an extension of graphs that allow a single edge to connect more than two vertices. Therefore, they describe models and processes more accurately additionally allowing more possibilities for improvement. Most multi-level hypergraph partitioning algorithms perform some computations on vertices and their set of neighbors. Since these computations can be super-linear, they have a significant impact on the overall running time on large hypergraphs. Therefore, to further reduce the size of hyperedges, we develop a pin-sparsifier based on the min-hash technique that clusters vertices with similar neighborhood. Further, vertices that belong to the same cluster are substituted by one vertex, which is connected to their neighbors, therefore, reducing the size of the hypergraph. Our algorithm sparsifies a hypergraph such that the resulting graph can be partitioned significantly faster without loss in quality (or with insignificant loss). On average, KaHyPar with sparsifier performs partitioning about 1.5 times faster while preserving solution quality if hyperedges are large. All aforementioned frameworks are publicly available
    corecore