29 research outputs found

    Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach

    Get PDF
    While it is well-known and acknowledged that the performance of graph algorithms is heavily dependent on the input data, there has been surprisingly little research to quantify and predict the impact the graph structure has on performance. Parallel graph algorithms, running on many-core systems such as GPUs, are no exception: most research has focused on how to efficiently implement and tune different graph operations on a specific GPU. However, the performance impact of the input graph has only been taken into account indirectly as a result of the graphs used to benchmark the system. In this work, we present a case study investigating how to use the properties of the input graph to improve the performance of the breadth-first search (BFS) graph traversal. To do so, we first study the performance variation of 15 different BFS implementations across 248 graphs. Using this performance data, we show that significant speed-up can be achieved by combining the best implementation for each level of the traversal. To make use of this data-dependent optimization, we must correctly predict the relative performance of algorithms per graph level, and enable dynamic switching to the optimal algorithm for each level at runtime. We use the collected performance data to train a binary decision tree, to enable high-accuracy predictions and fast switching. We demonstrate empirically that our decision tree is both fast enough to allow dynamic switching between implementations, without noticeable overhead, and accurate enough in its prediction to enable significant BFS speedup. We conclude that our model-driven approach (1) enables BFS to outperform state of the art GPU algorithms, and (2) can be adapted for other BFS variants, other algorithms, or more specific datasets

    Aristotle: A performance impact indicator for the OpenCL kernels using local memory

    Get PDF
    Due to the increasing complexity of multi/many-core architectures (with their mix of caches and scratch-pad memories) and applications (with different memory access patterns), the performance of many workloads becomes increasingly variable. In this work, we address one of the main causes for this performance variability: the efficiency of the memory system. Specifically, based on an empirical evaluation driven by memory access patterns, we qualify and partially quantify the performance impact of using local memory in multi/many-core processors. To do so, we systematically describe memory access patterns (MAPs) in an application-agnostic manner. Next, for each identified MAP, we use OpenCL (for portability reasons) to generate two microbenchmarks: a "naive" version (without local memory) and "an optimized" version (using local memory). We then evaluate both of them on typically used multi-core and many-core platforms, and we log their performance. What we eventually obtain is a local memory performance database, indexed by various MAPs and platforms. Further, we propose a set of composing rules for multiple MAPs. Thus, we can get an indicator of whether using local memory is beneficial in the presence of multiple memory access patterns. This indication can be used to either avoid the hassle of implementing optimizations with too little gain or, alternatively, give a rough prediction of the performance gain

    Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

    Full text link
    The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware

    EXTRA: Towards an efficient open platform for reconfigurable High Performance Computing

    Get PDF
    To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. In the EXTRA project, we create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. EXTRA covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will improve Europe's competitive advantage and leadership in the field

    The Future is Big Graphs! A Community View on Graph Processing Systems

    Get PDF
    Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?Comment: 12 pages, 3 figures, collaboration between the large-scale systems and data management communities, work started at the Dagstuhl Seminar 19491 on Big Graph Processing Systems, to be published in the Communications of the AC

    An effective strategy for porting c++ applications on cell

    No full text
    In this paper we present a solution for efficient porting of sequential C++ applications on the Cell B.E. processor. We present our step-by-step approach, focusing on its generality, we provide a set of code templates and optimization guidelines to support the porting, and we include a set of equations to estimate the performance gain of the new application. As a case-study, we show the use of our solution on a multimedia content analysis application, named MAR-VEL. The results of our experiments with MARVEL prove the significant performance increase in favor of the application running on Cell when compared with the reference implementation
    corecore