Search CORE

1,839 research outputs found

X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators

Author: Shetty Ashrit
Publication venue: Clemson University Libraries
Publication date: 01/08/2017
Field of study

Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intel’s Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance beneﬁts over traditional CPUs for a variety of scientiﬁc applications. We know that all accelerators are not similar because each of them has their own unique architecture. This diﬀerence in the underlying architecture plays a crucial role in determining if a given accelerator will provide a signiﬁcant speedup over its competition. In addition to the architecture itself, one more diﬀerentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Uniﬁed Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very diﬃcult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro viding the hardware counters as inputs to an inference model based on Random Forest Classiﬁcation Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a signiﬁcant edge over existing performance prediction solutions

Clemson University: TigerPrints

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Author: A Arnold
A Faradjian
B Hess
C Schütte
G Wilson
JA Anderson
JC Phillips
KJ Bowers
KJ Bowers
L Verlet
M Eleftheriou
M Shirts
MJ Abraham
P Eastman
R Yokota
S Pronk
S Páll
U Essmann
W Humphrey
WM Brown
Y Andoh
Y Sugita
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

MPG.PuRe

GPU-resident sparse direct linear solvers for alternating current optimal power flow analysis

Author: Abhyankar Shrirang
Anzt Hartwig
Göbel Fritz
Koukpaizan Nicholson
Peleš Slaven
Ribizel Tobias
Świrydowicz Kasia
Publication venue: Elsevier
Publication date: 21/11/2023
Field of study

Integrating renewable resources within the transmission grid at a wide scale poses significant challenges for economic dispatch as it requires analysis with more optimization parameters, constraints, and sources of uncertainty. This motivates the investigation of more efficient computational methods, especially those for solving the underlying linear systems, which typically take more than half of the overall computation time. In this paper, we present our work on sparse linear solvers that take advantage of hardware accelerators, such as graphical processing units (GPUs), and improve the overall performance when used within economic dispatch computations. We treat the problems as sparse, which allows for faster execution but also makes the implementation of numerical methods more challenging. We present the first GPU-native sparse direct solver that can execute on both AMD and NVIDIA GPUs. We demonstrate significant performance improvements when using high-performance linear solvers within alternating current optimal power flow (ACOPF) analysis. Furthermore, we demonstrate the feasibility of getting significant performance improvements by executing the entire computation on GPU-based hardware. Finally, we identify outstanding research issues and opportunities for even better utilization of heterogeneous systems, including those equipped with GPUs

KITopen

High Performance Implementation of Support Vector Machines Using OpenCL

Author: Peters Ethan
Publication venue: RIT Scholar Works
Publication date: 01/05/2014
Field of study

Support Vector Machines are a machine learning approach that is well studied, thoroughly vetted and effective in a large number of applications. The objective of this thesis is to accelerate an implementation of Support Vector Machines (SVM) using a heterogeneous computing system programmed using OpenCL in C/C++. LIBSVM, a widely-available, popular and open source implementation of SVM is chosen, allowing the presented work to be integrated seamlessly into existing systems. The proposed framework is evaluated in terms of speed and accuracy when performing training and classification on a number of standard data sets. Testing was based on two work station GPUs, the NVIDIA GTX 480 and Tesla K20, and a modern, work station CPU (Intel i5 Quad Core, 3 GHz). We find that, for large data sets, training is accelerated by a factor ranging from 9 to 22. In general, speedup increases with the total number of training samples in the data set until the GPU device is fully utilized. While these gains in speedup are significant, they do not match the ideal parallel speedup, that is the total number of cores in the parallel system. Our findings indicate that performance is hampered by the portions of the SVM training algorithm that are sequential. In addition, we find that the classification phase of the SVM system is accelerated by a factor of up to 12. During classification only a relatively small number of samples are classified compared to the typical number of training samples, and the computational complexity of classification grows only linearly with the number of samples processed, as opposed to the training phase where it grows quadratically. The contri- butions of this thesis include the use of OpenCL for accelerating SVM training and testing on heterogeneous systems, and the performance analysis of the acceleration of SVM

RIT Scholar Works

GPU-Resident Sparse Direct Linear Solvers for Alternating Current Optimal Power Flow Analysis

Author: Abhyankar Shrirang
Anzt Hartwig
Göbel Fritz
Koukpaizan Nicholson
Peleš Slaven
Ribizel Tobias
Świrydowicz Kasia
Publication venue
Publication date: 15/08/2023
Field of study

arXiv.org e-Print Archive

Analyzing the Hardware-Software Implications of Multi-modal DNN Workloads using MMBench

Author: Cheng Kwang-Ting
Hou Xiaofeng
Li Chao
Liu Jiacheng
Sun Linyu
Tang Xuehan
Xu Cheng
Publication venue
Publication date: 08/12/2022
Field of study

The explosive growth of various types of big data and advances in AI technologies have catalyzed a new type of applications called multi-modal DNNs. Multi-modal DNNs are capable of interpreting and reasoning about information from multiple modalities, making them more applicable to real-world AI scenarios. In recent research, multi-modal DNNs have outperformed the best uni-modal DNN in a wide range of applications from traditional multimedia to emerging autonomous systems. However, despite their importance and superiority, very limited research attention has been devoted to understand the characteristics of multi-modal DNNs and their implications on current computing software/hardware platforms. To facilitate research and advance the understanding of these multi-modal DNN workloads, we first present MMbench, an open-source benchmark suite consisting of a set of real-world multi-modal DNN workloads with relevant performance metrics for evaluation. Then we use MMbench to conduct an in-depth analysis on the characteristics of multi-modal DNNs. We study their implications on application and programming framework, operating and scheduling system, as well as execution hardware. Finally, we conduct a case study and extend our benchmark to edge devices. We hope that our work can provide guidance for future software/hardware design and optimization to underpin multi-modal DNNs on both cloud and edge computing platforms

arXiv.org e-Print Archive

Exploiting approximation, caching and specialization to accelerate vision sensing applications

Author: HUYNH Nguyen Loc
Publication venue: Singapore Management University
Publication date: 01/09/2019
Field of study

Institutional Knowledge at Singapore Management University

The fast multipole method at exascale

Author: Chandramowlishwaran Aparna
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

Scholarly Materials And Research @ Georgia Tech