47 research outputs found

    Graph analytics on modern massively parallel systems

    Get PDF
    Graphs provide a very flexible abstraction for understanding and modeling complex systems in many fields such as physics, biology, neuroscience, engineering, and social science. Only in the last two decades, with the advent of Big Data era, supercomputers equipped by accelerators –i.e., Graphics Processing Unit (GPUs)–, advanced networking, and highly parallel file systems have been used to analyze graph properties such as reachability, diameter, connected components, centrality, and clustering coefficient. Today graphs of interest may be composed by millions, sometimes billions, of nodes and edges and exhibit a highly irregular structure. As a consequence, the design of efficient and scalable graph algorithms is an extraordinary challenge due to irregular communication and memory access patterns, high synchronization costs, and lack of data locality. In the present dissertation, we start off with a brief and gentle introduction for the reader to graph analytics and massively parallel systems. In particular, we present the intersection between graph analytics and parallel architectures in the current state-of-the-art and discuss the challenges encountered when solving such problems on large-scale graphs on these architectures (Chapter 1). In Chapter 2, some preliminary definitions and graph-theoretical notions are provided together with a description of the synthetic graphs used in the literature to model real-world networks. In Chapters 3-5, we present and tackle three different relevant problems in graph analysis: reachability (Chapter 3), Betweenness Centrality (Chapter 4), and clustering coefficient (Chapter 5). In detail, Chapter 3 tackles reachability problems by providing two scalable algorithms and implementations which efficiently solve st-connectivity problems on very large-scale graphs Chapter 4 considers the problem of identifying most relevant nodes in a network which plays a crucial role in several applications, including transportation and communication networks, social network analysis, and biological networks. In particular, we focus on a well-known centrality metrics, namely Betweenness Centrality (BC), and present two different distributed algorithms for the BC computation on unweighted and weighted graphs. For unweighted graphs, we present a new communication-efficient algorithm based on the combination of bi-dimensional (2D) decomposition and multi-level parallelism. Furthermore, new algorithms which exploit the underlying graph topology to reduce the time and space usage of betweenness centrality computations are described as well. Concerning weighted graphs, we provide a scalable algorithm based on an algebraic formulation of the problem. Finally, thorough comprehensive experimental results on synthetic and real- world large-scale graphs, we show that the proposed techniques are effective in practice and achieve significant speedups against state-of-the-art solutions. Chapter 5 considers clustering coefficients problem. Similarly to Betweenness Centrality, it is a fundamental tool in network analysis, as it specifically measures how nodes tend to cluster together in a network. In the chapter, we first extend caching techniques to Remote Memory Access (RMA) operations on distributed-memory system. The caching layer is mainly designed to avoid inter-node communications in order to achieve similar benefits for irregular applications as communication-avoiding algorithms. We also show how cached RMA is able to improve the performance of a new distributed asynchronous algorithm for the computation of local clustering coefficients. Finally, Chapter 6 contains a brief summary of the key contributions described in the dissertation and presents potential future directions of the work

    A Computational Model for Tensor Core Units

    Full text link
    To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model

    Parallel execution of the ASP computation - An investigation on GPUs

    Get PDF
    This paper illustrates the design and implementation of a conflict-driven ASP solver that is capable of exploiting the Single-Instruction Multiple-Thread parallelism offered by General Purpose Graphical Processing Units (GPUs). Modern GPUs are multi-core platforms, providing access to large number of cores at a very low cost, but at the price of a complex architecture with non-trivial synchronization and communication costs. The search strategy of the ASP solver follows the notion of ASP computation, that avoids the generation of unfounded sets. Conflict analysis and learning are also implemented to help the search. The CPU is used only to pre-process the program and to output the results. All the solving components, i.e., nogoods management, search strategy, (non-chronological) backjumping, heuristics, conflict analysis and learning, and unit propagation, are performed on the GPU by exploiting SIMT parallelism. The preliminary experimental results confirm the feasibility and scalability of the approach, and the potential to enhance performance of ASP solvers

    Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers

    Full text link
    We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators. The work extends our previous efforts in the exploitation of a single GPU accelerator and proposes an implementation, based on the hybrid MPI-CUDA software environment, of a Krylov-type linear solver relying on an efficient Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG library. Our design for the hybrid implementation has been driven by the best practices for minimizing data communication overhead when multiple GPUs are employed, yet preserving the efficiency of the single GPU kernels. Strong and weak scalability results on well-known benchmark test cases of the new version of the library are discussed. Comparisons with the Nvidia AmgX solution show an improvement of up to 2.0x in the solve phase

    Towards a learning-based performance modeling for accelerating Deep Neural Networks

    Full text link
    Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Compute Library which provides three different implementations of the convolution operator at different numeric precision. Starting from a collation of benchmarks, we build and validate models learned by Decision Tree and naive Bayesian classifier. Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library

    A one-pot approach to novel pyridazine c-nucleosides

    Get PDF
    The synthesis of glycosides and modified nucleosides represents a wide research field in organic chemistry. The classical methodology is based on coupling reactions between a glycosyl donor and an acceptor. An alternative strategy for new C-nucleosides is used in this approach, which consists of modifying a pre-existent furyl aglycone. This approach is applied to obtain novel pyridazine C-nucleosides starting with 2-and 3-(ribofuranosyl)furans. It is based on singlet oxygen [4+2] cycloaddition followed by reduction and hydrazine cyclization under neutral conditions. The mild three-step one-pot procedure leads stereoselectively to novel pyridazine C-nucleosides of pharmacological interest. The use of acetyls as protecting groups provides an elegant direct route to a deprotected new pyridazine C-nucleoside

    Scaling Expected Force: Efficient Identification of Key Nodes in Network-based Epidemic Models

    Full text link
    Centrality measures are fundamental tools of network analysis as they highlight the key actors within the network. This study focuses on a newly proposed centrality measure, Expected Force (EF), and its use in identifying spreaders in network-based epidemic models. We found that EF effectively predicts the spreading power of nodes and identifies key nodes and immunization targets. However, its high computational cost presents a challenge for its use in large networks. To overcome this limitation, we propose two parallel scalable algorithms for computing EF scores: the first algorithm is based on the original formulation, while the second one focuses on a cluster-centric approach to improve efficiency and scalability. Our implementations significantly reduce computation time, allowing for the detection of key nodes at large scales. Performance analysis on synthetic and real-world networks demonstrates that the GPU implementation of our algorithm can efficiently scale to networks with up to 44 million edges by exploiting modern parallel architectures, achieving speed-ups of up to 300x, and 50x on average, compared to the simple parallel solution
    corecore