585 research outputs found
Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures
This paper evaluates the efficacy of recent commercial processing-in-memory
(PIM) solutions to accelerate fast Fourier transform (FFT), an important
primitive across several domains. Specifically, we observe that efficient
implementations of FFT on modern GPUs are memory bandwidth bound. As such, the
memory bandwidth boost availed by commercial PIM solutions makes a case for PIM
to accelerate FFT. To this end, we first deduce a mapping of FFT computation to
a strawman PIM architecture representative of recent commercial designs. We
observe that even with careful data mapping, PIM is not effective in
accelerating FFT. To address this, we make a case for collaborative
acceleration of FFT with PIM and GPU. Further, we propose software and hardware
innovations which lower PIM operations necessary for a given FFT. Overall, our
optimized PIM FFT mapping, termed Pimacolaba, delivers performance and data
movement savings of up to 1.38 and 2.76, respectively, over a
range of FFT sizes
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
A Build System for Benchmarking and Comparison of Competing System Implementations
When developing a hardware or software system, the problem at hand may lend itself to multiple solutions. During the implementation process for such systems, it can be helpful to prototype multiple versions that use distinct paradigms, and determine the efficiency of each according to some metric, such as execution time. This paper presents a portable, lightweight build system designed for easy benchmarking and verification of competing implementations of an algorithm. Also presented is a sample project that uses this system to compare the performance and correctness of CPU, GPU, and FPGA implementations of a signal recovery algorith
Recommended from our members
Breaking Computational Barriers to Perform Time Series Pattern Mining at Scale and at the Edge
Uncovering repeated behavior in time series is an important problem in many domains such as medicine, geophysics, meteorology, and many more. With the continuing surge of smart/embedded devices generating time series data, there is an ever growing need to perform analysis on datasets of increasing size. Additionally, there is an increasing need for analysis at low power edge devices due to latency problems inherent to the speed of light and the sheer amount of data being recorded. The matrix profile has proven to be a tool highly suitable for pattern mining in time series; however, a naive approach to computing the matrix profile makes it impossible to use effectively in both the cloud and at the edge. This dissertation shows how, through the use of GPUs and machine learning, the matrix profile is computed more feasibly, both at cloud-scale and at sensor-scale. In addition, it illustrates why both of these types of computation are important and what new insights they can provide to practitioners working with time series data
Accelerated Encrypted Execution of General-Purpose Applications
Fully Homomorphic Encryption (FHE) is a cryptographic method that guarantees the privacy and security of user data during computation. FHE algorithms can perform unlimited arithmetic computations directly on encrypted data without decrypting it. Thus, even when processed by untrusted systems, confidential data is never exposed. In this work, we develop new techniques for accelerated encrypted execution and demonstrate the significant performance advantages of our approach. Our current focus is the Fully Homomorphic Encryption over the Torus (CGGI) scheme, which is a current state-of-the-art method for evaluating arbitrary functions in the encrypted domain. CGGI represents a computation as a graph of homomorphic logic gates and each individual bit of the plaintext is transformed into a polynomial in the encrypted domain. Arithmetic on such data becomes very expensive: operations on bits become operations on entire polynomials. Therefore, evaluating even relatively simple nonlinear functions, such as a sigmoid, can take thousands of seconds on a single CPU thread. Using our novel framework for end-to-end accelerated encrypted execution called ArctyrEX, developers with no knowledge of complex FHE libraries can simply describe their computation as a C program that is evaluated over 40x faster on an NVIDIA DGX A100 and 6x faster with a single A100 relative to a 256-threaded CPU baseline
Data Visualization for Benchmarking Neural Networks in Different Hardware Platforms
The computational complexity of Convolutional Neural Networks has increased enor mously; hence numerous algorithmic optimization techniques have been widely proposed.
However, in a space design so complex, it is challenging to choose which optimization will
benefit from which type of hardware platform. This is why QuTiBench - a benchmarking
methodology - was recently proposed, and it provides clarity into the design space. With
measurements resulting in more than nine thousand data points, it became difficult to
get useful and rich information quickly and intuitively from the vast data collected.
Thereby this effort describes the creation of a web portal where all data is exposed
and can be adequately visualized. All the code developed in this project resides in an
online public GitHub repository, allowing contributions.
Using visualizations which grab our interest and keep our eyes on the message is the
perfect way to understand the data and spot trends. Thus, several types of plots were
used: rooflines, heatmaps, line plots, bar plots and Box and Whisker Plots.
Furthermore, as level-0 of QuTiBench performs a theoretical analysis of the data,
with no measurements required, performance predictions were evaluated. We concluded
that predictions successfully predicted performance trends. Although being somewhat
optimistic because predictions become inaccurate with the increased pruning and quan tization. The theoretical analysis could be improved by the increased awareness of what
data is stored in the on and off-chip memory. Moreover, for the FPGAs, performance
predictions can be further enhanced by taking the actual resource utilization and the
achieved clock frequency of the FPGA circuit into account. With these improvements to
level-0 of QuTiBench, this benchmarking methodology can become more accurate on the
next measurements, becoming more reliable and useful to designers.
Moreover, more measurements were taken, in particular, power, performance and
accuracy measurements were taken for Google’s USB Accelerator benchmarking Efficient Net S, EfficientNet M and EfficientNet L. In general, performance measurements were
reproduced; however, it was not possible to reproduce accuracy measurements
Exponential integrators: tensor structured problems and applications
The solution of stiff systems of Ordinary Differential Equations (ODEs), that typically arise after spatial discretization of many important evolutionary Partial Differential Equations (PDEs), constitutes a topic of wide interest in numerical analysis. A prominent way to numerically integrate such systems involves using exponential integrators. In general, these kinds of schemes do not require the solution of (non)linear systems but rather the action of the matrix exponential and of some specific exponential-like functions (known in the literature as phi-functions). In this PhD thesis we aim at presenting efficient tensor-based tools to approximate such actions, both from a theoretical and from a practical point of view, when the problem has an underlying Kronecker sum structure. Moreover, we investigate the application of exponential integrators to compute numerical solutions of important equations in various fields, such as plasma physics, mean-field optimal control and computational chemistry. In any case, we provide several numerical examples and we perform extensive simulations, eventually exploiting modern hardware architectures such as multi-core Central Processing Units (CPUs) and Graphic Processing Units (GPUs). The results globally show the effectiveness and the superiority of the different approaches proposed
- …