585 research outputs found

    Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

    Full text link
    This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT. To this end, we first deduce a mapping of FFT computation to a strawman PIM architecture representative of recent commercial designs. We observe that even with careful data mapping, PIM is not effective in accelerating FFT. To address this, we make a case for collaborative acceleration of FFT with PIM and GPU. Further, we propose software and hardware innovations which lower PIM operations necessary for a given FFT. Overall, our optimized PIM FFT mapping, termed Pimacolaba, delivers performance and data movement savings of up to 1.38×\times and 2.76×\times, respectively, over a range of FFT sizes

    Multi-GPU support on the marrow algorithmic skeleton framework

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations such as transparent load balancing mechanism and reduced explicit code overhead. Algorithmic Skeletons, previously used in cluster environments, have recently been adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by defining computation patterns of commonly used algorithms. The Marrow algorithmic skeleton library is one of these, taking advantage of the abstractions to automate the orchestration needed for an efficient GPU execution. This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine. We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200

    A Build System for Benchmarking and Comparison of Competing System Implementations

    Get PDF
    When developing a hardware or software system, the problem at hand may lend itself to multiple solutions. During the implementation process for such systems, it can be helpful to prototype multiple versions that use distinct paradigms, and determine the efficiency of each according to some metric, such as execution time. This paper presents a portable, lightweight build system designed for easy benchmarking and verification of competing implementations of an algorithm. Also presented is a sample project that uses this system to compare the performance and correctness of CPU, GPU, and FPGA implementations of a signal recovery algorith

    Accelerated Encrypted Execution of General-Purpose Applications

    Get PDF
    Fully Homomorphic Encryption (FHE) is a cryptographic method that guarantees the privacy and security of user data during computation. FHE algorithms can perform unlimited arithmetic computations directly on encrypted data without decrypting it. Thus, even when processed by untrusted systems, confidential data is never exposed. In this work, we develop new techniques for accelerated encrypted execution and demonstrate the significant performance advantages of our approach. Our current focus is the Fully Homomorphic Encryption over the Torus (CGGI) scheme, which is a current state-of-the-art method for evaluating arbitrary functions in the encrypted domain. CGGI represents a computation as a graph of homomorphic logic gates and each individual bit of the plaintext is transformed into a polynomial in the encrypted domain. Arithmetic on such data becomes very expensive: operations on bits become operations on entire polynomials. Therefore, evaluating even relatively simple nonlinear functions, such as a sigmoid, can take thousands of seconds on a single CPU thread. Using our novel framework for end-to-end accelerated encrypted execution called ArctyrEX, developers with no knowledge of complex FHE libraries can simply describe their computation as a C program that is evaluated over 40x faster on an NVIDIA DGX A100 and 6x faster with a single A100 relative to a 256-threaded CPU baseline

    Data Visualization for Benchmarking Neural Networks in Different Hardware Platforms

    Get PDF
    The computational complexity of Convolutional Neural Networks has increased enor mously; hence numerous algorithmic optimization techniques have been widely proposed. However, in a space design so complex, it is challenging to choose which optimization will benefit from which type of hardware platform. This is why QuTiBench - a benchmarking methodology - was recently proposed, and it provides clarity into the design space. With measurements resulting in more than nine thousand data points, it became difficult to get useful and rich information quickly and intuitively from the vast data collected. Thereby this effort describes the creation of a web portal where all data is exposed and can be adequately visualized. All the code developed in this project resides in an online public GitHub repository, allowing contributions. Using visualizations which grab our interest and keep our eyes on the message is the perfect way to understand the data and spot trends. Thus, several types of plots were used: rooflines, heatmaps, line plots, bar plots and Box and Whisker Plots. Furthermore, as level-0 of QuTiBench performs a theoretical analysis of the data, with no measurements required, performance predictions were evaluated. We concluded that predictions successfully predicted performance trends. Although being somewhat optimistic because predictions become inaccurate with the increased pruning and quan tization. The theoretical analysis could be improved by the increased awareness of what data is stored in the on and off-chip memory. Moreover, for the FPGAs, performance predictions can be further enhanced by taking the actual resource utilization and the achieved clock frequency of the FPGA circuit into account. With these improvements to level-0 of QuTiBench, this benchmarking methodology can become more accurate on the next measurements, becoming more reliable and useful to designers. Moreover, more measurements were taken, in particular, power, performance and accuracy measurements were taken for Google’s USB Accelerator benchmarking Efficient Net S, EfficientNet M and EfficientNet L. In general, performance measurements were reproduced; however, it was not possible to reproduce accuracy measurements

    Exponential integrators: tensor structured problems and applications

    Get PDF
    The solution of stiff systems of Ordinary Differential Equations (ODEs), that typically arise after spatial discretization of many important evolutionary Partial Differential Equations (PDEs), constitutes a topic of wide interest in numerical analysis. A prominent way to numerically integrate such systems involves using exponential integrators. In general, these kinds of schemes do not require the solution of (non)linear systems but rather the action of the matrix exponential and of some specific exponential-like functions (known in the literature as phi-functions). In this PhD thesis we aim at presenting efficient tensor-based tools to approximate such actions, both from a theoretical and from a practical point of view, when the problem has an underlying Kronecker sum structure. Moreover, we investigate the application of exponential integrators to compute numerical solutions of important equations in various fields, such as plasma physics, mean-field optimal control and computational chemistry. In any case, we provide several numerical examples and we perform extensive simulations, eventually exploiting modern hardware architectures such as multi-core Central Processing Units (CPUs) and Graphic Processing Units (GPUs). The results globally show the effectiveness and the superiority of the different approaches proposed
    • …
    corecore