291 research outputs found

    Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture

    Full text link
    Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL architectural enhancements on the performance of a state-of-the art graph framework.We have four key observations: 1 Different graph applications require distinctive numbers of threads to reach the peak performance. For the same application, various datasets need even different numbers of threads to achieve the best performance. 2 Only a few graph applications benefit from the high bandwidth MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units executing AVX512 SIMD instructions on KNLs are underutilized when running the state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness. At last, We suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture," 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, United Kingdom, 2018, pp. 199-20

    AlSub: Fully Parallel and Modular Subdivision

    Full text link
    In recent years, mesh subdivision---the process of forging smooth free-form surfaces from coarse polygonal meshes---has become an indispensable production instrument. Although subdivision performance is crucial during simulation, animation and rendering, state-of-the-art approaches still rely on serial implementations for complex parts of the subdivision process. Therefore, they often fail to harness the power of modern parallel devices, like the graphics processing unit (GPU), for large parts of the algorithm and must resort to time-consuming serial preprocessing. In this paper, we show that a complete parallelization of the subdivision process for modern architectures is possible. Building on sparse matrix linear algebra, we show how to structure the complete subdivision process into a sequence of algebra operations. By restructuring and grouping these operations, we adapt the process for different use cases, such as regular subdivision of dynamic meshes, uniform subdivision for immutable topology, and feature-adaptive subdivision for efficient rendering of animated models. As the same machinery is used for all use cases, identical subdivision results are achieved in all parts of the production pipeline. As a second contribution, we show how these linear algebra formulations can effectively be translated into efficient GPU kernels. Applying our strategies to 3\sqrt{3}, Loop and Catmull-Clark subdivision shows significant speedups of our approach compared to state-of-the-art solutions, while we completely avoid serial preprocessing.Comment: Changed structure Added content Improved description

    Single-Strip Triangulation of Manifolds with Arbitrary Topology

    Full text link
    Triangle strips have been widely used for efficient rendering. It is NP-complete to test whether a given triangulated model can be represented as a single triangle strip, so many heuristics have been proposed to partition models into few long strips. In this paper, we present a new algorithm for creating a single triangle loop or strip from a triangulated model. Our method applies a dual graph matching algorithm to partition the mesh into cycles, and then merges pairs of cycles by splitting adjacent triangles when necessary. New vertices are introduced at midpoints of edges and the new triangles thus formed are coplanar with their parent triangles, hence the visual fidelity of the geometry is not changed. We prove that the increase in the number of triangles due to this splitting is 50% in the worst case, however for all models we tested the increase was less than 2%. We also prove tight bounds on the number of triangles needed for a single-strip representation of a model with holes on its boundary. Our strips can be used not only for efficient rendering, but also for other applications including the generation of space filling curves on a manifold of any arbitrary topology.Comment: 12 pages, 10 figures. To appear at Eurographics 200

    Random-Accessible Compressed Triangle Meshes

    Full text link

    Memory sharing for interactive ray tracing on clusters

    Get PDF
    ManuscriptWe present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster's memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributed shared memories are compared, and optimizations for efficient memory use are discussed

    Iterative stripification of a triangle mesh: focus on data structures

    Get PDF
    In this paper we describe the data structure and some implementation details of the tunneling algorithm for generating a set of triangle strips from a mesh of triangles. The algorithm uses a simple topological operation on the dual graph of the mesh, to generate an initial stripification and iteratively rearrange and decrease the number of strips. Our method is a major improvement of a proposed one originally devised for both static and continuous level-of-detail (CLOD) meshes and retains this feature. The usage of a dynamical identification strategy for the strips allows us to drastically reduce the length of the searching paths in the graph needed for the rearrangement and produce loop-free triangle strips without any further controls and post-processing, while requiring a more sophisticated implementation to manage the search and undo operations

    An improved adjacency data structure for fast triangle stripping

    Get PDF
    To speed up the rendering of polygonal meshes, triangle strips are commonly used to reduce the number of vertices sent to the graphics subsystem by exploiting the fact that adjacent triangles share an edge. In this paper, we present an improved adjacency data structure for fast triangle stripping algorithms. There are three major contributions: first, the data structure can be created quickly and robustly from any indexed face set; second, its cache-friendly layout is specifically designed to efficiently answer common stripping queries, such as neighbor finding and least-degree triangle finding, in constant time; third, the stripping algorithm operates in-place, since strips are created by simply relinking pointers. An implementation of a stripping algorithm shows a significant speed-up compared to other implementations. Our implementation is publicly available as part of OpenSG [9].

    Graph analytics on modern massively parallel systems

    Get PDF
    Graphs provide a very flexible abstraction for understanding and modeling complex systems in many fields such as physics, biology, neuroscience, engineering, and social science. Only in the last two decades, with the advent of Big Data era, supercomputers equipped by accelerators –i.e., Graphics Processing Unit (GPUs)–, advanced networking, and highly parallel file systems have been used to analyze graph properties such as reachability, diameter, connected components, centrality, and clustering coefficient. Today graphs of interest may be composed by millions, sometimes billions, of nodes and edges and exhibit a highly irregular structure. As a consequence, the design of efficient and scalable graph algorithms is an extraordinary challenge due to irregular communication and memory access patterns, high synchronization costs, and lack of data locality. In the present dissertation, we start off with a brief and gentle introduction for the reader to graph analytics and massively parallel systems. In particular, we present the intersection between graph analytics and parallel architectures in the current state-of-the-art and discuss the challenges encountered when solving such problems on large-scale graphs on these architectures (Chapter 1). In Chapter 2, some preliminary definitions and graph-theoretical notions are provided together with a description of the synthetic graphs used in the literature to model real-world networks. In Chapters 3-5, we present and tackle three different relevant problems in graph analysis: reachability (Chapter 3), Betweenness Centrality (Chapter 4), and clustering coefficient (Chapter 5). In detail, Chapter 3 tackles reachability problems by providing two scalable algorithms and implementations which efficiently solve st-connectivity problems on very large-scale graphs Chapter 4 considers the problem of identifying most relevant nodes in a network which plays a crucial role in several applications, including transportation and communication networks, social network analysis, and biological networks. In particular, we focus on a well-known centrality metrics, namely Betweenness Centrality (BC), and present two different distributed algorithms for the BC computation on unweighted and weighted graphs. For unweighted graphs, we present a new communication-efficient algorithm based on the combination of bi-dimensional (2D) decomposition and multi-level parallelism. Furthermore, new algorithms which exploit the underlying graph topology to reduce the time and space usage of betweenness centrality computations are described as well. Concerning weighted graphs, we provide a scalable algorithm based on an algebraic formulation of the problem. Finally, thorough comprehensive experimental results on synthetic and real- world large-scale graphs, we show that the proposed techniques are effective in practice and achieve significant speedups against state-of-the-art solutions. Chapter 5 considers clustering coefficients problem. Similarly to Betweenness Centrality, it is a fundamental tool in network analysis, as it specifically measures how nodes tend to cluster together in a network. In the chapter, we first extend caching techniques to Remote Memory Access (RMA) operations on distributed-memory system. The caching layer is mainly designed to avoid inter-node communications in order to achieve similar benefits for irregular applications as communication-avoiding algorithms. We also show how cached RMA is able to improve the performance of a new distributed asynchronous algorithm for the computation of local clustering coefficients. Finally, Chapter 6 contains a brief summary of the key contributions described in the dissertation and presents potential future directions of the work
    • …
    corecore